douan8473 2013-06-26 13:43 采纳率: 100%
浏览 57
已采纳

PHP中的DOM:解码实体和设置nodeValue

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.

Suppose we have the following code

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

foreach($node_list as $node) {
    //do something
}

If the code in the loop is something like

$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);

it works fine. But if it's more like

$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;

and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.

At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know

  1. if that is sufficient,
  2. if not, what would suffice,
  3. and if there are more elegant solutions for this, as in the case of attribute manipulation.

The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.


EDIT

It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);

$doc = new DOMDocument();
$doc->loadXML($output);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');

foreach($node_list as $node) {
        $node->nodeValue = $node->textContent;
}
echo $doc->saveXML();

If I execute this code on the CLI with

php beeb.php |egrep 'link|Warning'

I get results like

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>

which should be

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>

(and is, if the loop is omitted) and according warnings

Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15

When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.

  • 写回答

2条回答 默认 最新

  • dongyuli4538 2013-08-15 14:03
    关注

    As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard. To illustrate this, an example:

    $doc = new DOMDocument();
    $doc->formatOutput = True;
    $doc->loadXML('<root/>');
    
    $s = 'text &amp;&lt;<"\'&text;&text';
    
    $root = $doc->documentElement;
    
    $node = $doc->createElement('tag1', $s); #line 10
    $root->appendChild($node);
    
    $node = $doc->createElement('tag2');
    $text = $doc->createTextNode($s);
    $node->appendChild($text);
    $root->appendChild($node);
    
    $node = $doc->createElement('tag3');
    $text = $doc->createCDATASection($s);
    $node->appendChild($text);
    $root->appendChild($node);
    
    echo $doc->saveXML();
    

    outputs

    Warning: DOMDocument::createElement(): unterminated entity reference            text in /tmp/DOMtest.php on line 10
    <?xml version="1.0"?>
    <root>
      <tag1>text &amp;&lt;&lt;"'&text;</tag1>
      <tag2>text &amp;amp;&amp;lt;&lt;"'&amp;text;&amp;text</tag2>
      <tag3><![CDATA[text &amp;&lt;<"'&text;&text]]></tag3>
    </root>
    

    In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.

    $doc = new DOMDocument();
    $doc->loadXML(<XML data>);
    
    $xpath     = new DOMXPath($doc);
    $node_list = $xpath->query(<some XPath>);
    
    $visitTextNode = function (DOMText $node) {
        $text = $node->textContent;
        /*
            do something with $text
        */
       $node->nodeValue = $text;
    };
    
    foreach ($node_list as $node) {
        if ($node->nodeType == XML_TEXT_NODE) {
            $visitTextNode($node);
        } else {
            foreach ($node->childNodes as $child) {
                if ($child->nodeType == XML_TEXT_NODE) {
                    $visitTextNode($child);
                }
            }
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 软件测试决策法疑问求解答
  • ¥15 win11 23H2删除推荐的项目,支持注册表等
  • ¥15 matlab 用yalmip搭建模型,cplex求解,线性化处理的方法
  • ¥15 qt6.6.3 基于百度云的语音识别 不会改
  • ¥15 关于#目标检测#的问题:大概就是类似后台自动检测某下架商品的库存,在他监测到该商品上架并且可以购买的瞬间点击立即购买下单
  • ¥15 神经网络怎么把隐含层变量融合到损失函数中?
  • ¥15 lingo18勾选global solver求解使用的算法
  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来