dongmi4809
2014-06-01 15:12
浏览 104

使用cdata解析xml feed PHP SimpleXML [复制]

This question already has an answer here:

I am parsing a rss feed to json using php.

using below code

my json output contains data out of description from item element but title and link data not extracting

  • problem is some where with incorrent CDATA or my code is not parsing it correctly.

xml is here

$blog_url = 'http://www.blogdogarotinho.com/rssfeedgenerator.ashx';

$rawFeed = file_get_contents($blog_url);
$xml=simplexml_load_string($rawFeed,'SimpleXMLElement', LIBXML_NOCDATA);

// step 2: extract the channel metadata
$articles = array();    

// step 3: extract the articles

foreach ($xml->channel->item as $item) {
    $article = array();

    $article['title'] = (string)trim($item->title);
    $article['link'] = $item->link;      
    $article['pubDate'] = $item->pubDate;
    $article['timestamp'] = strtotime($item->pubDate);
    $article['description'] = (string)trim($item->description);
    $article['isPermaLink'] = $item->guid['isPermaLink'];        

    $articles[$article['timestamp']] = $article;
}

echo json_encode($articles);
</div>

图片转代码服务由CSDN问答提供 功能建议

此问题已经存在 这里有一个答案:

  • 如何使用SimpleXML解析XML的CDATA HTML内容? \ n 2 answers

    我正在解析json的RSS源 使用php。

    使用下面的代码

    我的json输出包含来自item元素的描述数据,但标题和链接数据未提取

    • 问题出现在CDATA或我的代码没有正确解析的地方。

      xml是这里

        $ blog_url ='http:// www  .blogdogarotinho.com / R  ssfeedgenerator.ashx'; 
       
       $ rawFeed = file_get_contents($ blog_url); 
       $ xml = simplexml_load_string($ rawFeed,'SimpleXMLElement',LIBXML_NOCDATA); 
       
       //步骤2:提取频道元数据
        $ articles = array();  
       
       //步骤3:提取文章
       
      foreach($ xml-&gt; channel-&gt; item as $ item){
       $ article = array(); 
       
       $ article ['title  '] =(字符串)trim($ item-&gt; title); 
       $ article ['link'] = $ item-&gt; link;  
       $ article ['pubDate'] = $ item-&gt; pubDate; 
       $ article ['timestamp'] = strtotime($ item-&gt; pubDate); 
       $ article ['description'] =(string)  trim($ item-&gt; description); 
       $ article ['isPermaLink'] = $ item-&gt; guid ['isPermaLink'];  
       
       $ articles [$ article ['timestamp']] = $ article; 
      } 
       
      echo json_encode($ articles); 
         
       
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • dplsnw7329 2014-06-01 18:28
    已采纳

    I think you are just the victim of the browser hiding the tags. Let me explain: Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit ctrl+u on the rss link in your browser and you will see:

    <?xml version="1.0" encoding="utf-16"?>
    <rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
      <channel>
        <description>Blog do Garotinho</description>
        <item>
          <description>&lt;![CDATA[&lt;br&gt;
              Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]&gt;
          </description>
          <link>&lt;![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]&gt;</link>
    ...
          <title>&lt;![CDATA[A bancada dos caras de pau]]&gt;</title>
        </item>
    

    As you can see the <title> for example starts with a &lt; which when will turn to a < when simplexml returns it for your json data. Now if you are looking the printed json data in a browser your browser will see the following:

    "title":"<![CDATA[A bancada dos caras de pau]]>"
    

    Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.

    If you hit ctrl+u you should see the output printed as expected (i myself used a command line php file and did not notice this first).

    Try this demo:

    You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():

    function clean_cdata($str) {
        return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
    }
    

    This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:

    // ....
    $article['title'] = clean_cdata($item->title);
    // ....
    
    点赞 评论

相关推荐 更多相似问题