dongmi4809 2014-06-01 15:12
浏览 104
已采纳

使用cdata解析xml feed PHP SimpleXML [复制]

This question already has an answer here:

I am parsing a rss feed to json using php.

using below code

my json output contains data out of description from item element but title and link data not extracting

  • problem is some where with incorrent CDATA or my code is not parsing it correctly.

xml is here

$blog_url = 'http://www.blogdogarotinho.com/rssfeedgenerator.ashx';

$rawFeed = file_get_contents($blog_url);
$xml=simplexml_load_string($rawFeed,'SimpleXMLElement', LIBXML_NOCDATA);

// step 2: extract the channel metadata
$articles = array();    

// step 3: extract the articles

foreach ($xml->channel->item as $item) {
    $article = array();

    $article['title'] = (string)trim($item->title);
    $article['link'] = $item->link;      
    $article['pubDate'] = $item->pubDate;
    $article['timestamp'] = strtotime($item->pubDate);
    $article['description'] = (string)trim($item->description);
    $article['isPermaLink'] = $item->guid['isPermaLink'];        

    $articles[$article['timestamp']] = $article;
}

echo json_encode($articles);
</div>
  • 写回答

1条回答 默认 最新

  • dplsnw7329 2014-06-01 18:28
    关注

    I think you are just the victim of the browser hiding the tags. Let me explain: Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit <kbd>ctrl</kbd>+<kbd>u</kbd> on the rss link in your browser and you will see:

    <?xml version="1.0" encoding="utf-16"?>
    <rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
      <channel>
        <description>Blog do Garotinho</description>
        <item>
          <description>&lt;![CDATA[&lt;br&gt;
              Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]&gt;
          </description>
          <link>&lt;![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]&gt;</link>
    ...
          <title>&lt;![CDATA[A bancada dos caras de pau]]&gt;</title>
        </item>
    

    As you can see the <title> for example starts with a &lt; which when will turn to a < when simplexml returns it for your json data. Now if you are looking the printed json data in a browser your browser will see the following:

    "title":"<![CDATA[A bancada dos caras de pau]]>"
    

    Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.

    If you hit <kbd>ctrl</kbd>+<kbd>u</kbd> you should see the output printed as expected (i myself used a command line php file and did not notice this first).

    Try this demo:

    You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():

    function clean_cdata($str) {
        return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
    }
    

    This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:

    // ....
    $article['title'] = clean_cdata($item->title);
    // ....
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器