dongmi4809 2014-06-01 15:12
浏览 104
已采纳

使用cdata解析xml feed PHP SimpleXML [复制]

This question already has an answer here:

I am parsing a rss feed to json using php.

using below code

my json output contains data out of description from item element but title and link data not extracting

  • problem is some where with incorrent CDATA or my code is not parsing it correctly.

xml is here

$blog_url = 'http://www.blogdogarotinho.com/rssfeedgenerator.ashx';

$rawFeed = file_get_contents($blog_url);
$xml=simplexml_load_string($rawFeed,'SimpleXMLElement', LIBXML_NOCDATA);

// step 2: extract the channel metadata
$articles = array();    

// step 3: extract the articles

foreach ($xml->channel->item as $item) {
    $article = array();

    $article['title'] = (string)trim($item->title);
    $article['link'] = $item->link;      
    $article['pubDate'] = $item->pubDate;
    $article['timestamp'] = strtotime($item->pubDate);
    $article['description'] = (string)trim($item->description);
    $article['isPermaLink'] = $item->guid['isPermaLink'];        

    $articles[$article['timestamp']] = $article;
}

echo json_encode($articles);
</div>
  • 写回答

1条回答 默认 最新

  • dplsnw7329 2014-06-01 18:28
    关注

    I think you are just the victim of the browser hiding the tags. Let me explain: Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit <kbd>ctrl</kbd>+<kbd>u</kbd> on the rss link in your browser and you will see:

    <?xml version="1.0" encoding="utf-16"?>
    <rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
      <channel>
        <description>Blog do Garotinho</description>
        <item>
          <description>&lt;![CDATA[&lt;br&gt;
              Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]&gt;
          </description>
          <link>&lt;![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]&gt;</link>
    ...
          <title>&lt;![CDATA[A bancada dos caras de pau]]&gt;</title>
        </item>
    

    As you can see the <title> for example starts with a &lt; which when will turn to a < when simplexml returns it for your json data. Now if you are looking the printed json data in a browser your browser will see the following:

    "title":"<![CDATA[A bancada dos caras de pau]]>"
    

    Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.

    If you hit <kbd>ctrl</kbd>+<kbd>u</kbd> you should see the output printed as expected (i myself used a command line php file and did not notice this first).

    Try this demo:

    You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():

    function clean_cdata($str) {
        return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
    }
    

    This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:

    // ....
    $article['title'] = clean_cdata($item->title);
    // ....
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 sub地址DHCP问题
  • ¥15 delta降尺度计算的一些细节,有偿
  • ¥15 Arduino红外遥控代码有问题
  • ¥15 数值计算离散正交多项式
  • ¥30 数值计算均差系数编程
  • ¥15 redis-full-check比较 两个集群的数据出错
  • ¥15 Matlab编程问题
  • ¥15 训练的多模态特征融合模型准确度很低怎么办
  • ¥15 kylin启动报错log4j类冲突
  • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大