douan8473 2013-06-26 13:43 采纳率: 100%
浏览 58
已采纳

PHP中的DOM:解码实体和设置nodeValue

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.

Suppose we have the following code

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

foreach($node_list as $node) {
    //do something
}

If the code in the loop is something like

$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);

it works fine. But if it's more like

$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;

and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.

At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know

  1. if that is sufficient,
  2. if not, what would suffice,
  3. and if there are more elegant solutions for this, as in the case of attribute manipulation.

The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.


EDIT

It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);

$doc = new DOMDocument();
$doc->loadXML($output);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');

foreach($node_list as $node) {
        $node->nodeValue = $node->textContent;
}
echo $doc->saveXML();

If I execute this code on the CLI with

php beeb.php |egrep 'link|Warning'

I get results like

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>

which should be

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>

(and is, if the loop is omitted) and according warnings

Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15

When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.

  • 写回答

2条回答 默认 最新

  • dongyuli4538 2013-08-15 14:03
    关注

    As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard. To illustrate this, an example:

    $doc = new DOMDocument();
    $doc->formatOutput = True;
    $doc->loadXML('<root/>');
    
    $s = 'text &amp;&lt;<"\'&text;&text';
    
    $root = $doc->documentElement;
    
    $node = $doc->createElement('tag1', $s); #line 10
    $root->appendChild($node);
    
    $node = $doc->createElement('tag2');
    $text = $doc->createTextNode($s);
    $node->appendChild($text);
    $root->appendChild($node);
    
    $node = $doc->createElement('tag3');
    $text = $doc->createCDATASection($s);
    $node->appendChild($text);
    $root->appendChild($node);
    
    echo $doc->saveXML();
    

    outputs

    Warning: DOMDocument::createElement(): unterminated entity reference            text in /tmp/DOMtest.php on line 10
    <?xml version="1.0"?>
    <root>
      <tag1>text &amp;&lt;&lt;"'&text;</tag1>
      <tag2>text &amp;amp;&amp;lt;&lt;"'&amp;text;&amp;text</tag2>
      <tag3><![CDATA[text &amp;&lt;<"'&text;&text]]></tag3>
    </root>
    

    In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.

    $doc = new DOMDocument();
    $doc->loadXML(<XML data>);
    
    $xpath     = new DOMXPath($doc);
    $node_list = $xpath->query(<some XPath>);
    
    $visitTextNode = function (DOMText $node) {
        $text = $node->textContent;
        /*
            do something with $text
        */
       $node->nodeValue = $text;
    };
    
    foreach ($node_list as $node) {
        if ($node->nodeType == XML_TEXT_NODE) {
            $visitTextNode($node);
        } else {
            foreach ($node->childNodes as $child) {
                if ($child->nodeType == XML_TEXT_NODE) {
                    $visitTextNode($child);
                }
            }
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?