douhui4831 2011-12-14 22:00
浏览 45
已采纳

XMLReader和simpleXML组合,带条件

I am using a combination of XMLReader and simpleXML to parse the Posts in a WordPress export file. I realize this is a little out of the norm but, its more of backup project, so we can easily pull up one of these articles if we need it in the futre. The WP site that they were on needs to come down.

The issue I am having is that some of the nodes in the XML file are empty or contain useless values (ie. Not full posts). I need to add some string length conditions but, I'm not sure how to check for each one.

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';


$reader = new XMLReader();
                $reader->open($path_to_xml_file);
                while($reader->read())
                {
                        if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'item')
                        {
                                        $doc = new DOMDocument('1.0', 'UTF-8');
                                        $xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
                                        //echo $xml->title; //or whatever

// Take care of the articles
$newcontent = $xml->children('http://purl.org/rss/1.0/modules/content/');
$contentString = $newcontent->encoded;
$titleString = $xml->title;

    echo '
    <div class="article-container" id="article-' .  $xml->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $xml->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
                        }
                }

?>

I was able to successfully check this with just simpleXML but, it was too much of a memory hog all by itself. This was my simplexml code:

<?php 

    $url = 'wordpress.2011.xml.gz';
    $xml = new SimpleXMLElement("compress.zlib://$url", NULL, TRUE);

    foreach ($xml->item as $item) :

    $newcontent = $item->children('http://purl.org/rss/1.0/modules/content/');

    ?>

<?php
$contentString = $newcontent->encoded;
$titleString = $item->title;

if ((strlen($contentString) < 13) || (strlen($titleString) < 5))  {
    echo '';
} else {
    echo '
    <div class="article-container" id="article-' .  $item->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $item->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
}
?>



 <?php endforeach; ?>

UPDATE

With Francis' help, it is working now. Here is the code:

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';

$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
        $doc = new DOMDocument('1.0','UTF-8');
        $xml = simplexml_import_dom($doc->importNode($reader->expand(), true));
        $titleString = (string) $xml->title;
        $contentString = (string) $xml->children($contentNS)->encoded;
        if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
            // Be careful with your output escaping!
            // This below looks like it might be wrong:
            // - $titleString for an ID (use slug)
            // - $titleString not escaped
            // - $contentString should be escaped? not sure here.
            // Have you considered using XMLWriter()?
            echo '
<div class="article-container" id="article-' .  $titleString . '">
    <a href="#top" class="top-link">Back to the Top</a>
    <h2>' .  $titleString . '</h2>
    <div class="articles">' . $contentString . '</div>
</div>';
        } else {

        echo'';

        }

        $reader->next(); //skip the subtrees, go to next item sibling
        // we already expand()ed this so we don't need to walk it.
    }
}

?>
  • 写回答

1条回答 默认 最新

  • doujiaoang69440 2011-12-15 02:08
    关注

    When you say $contentString = $newcontent->encoded, the type of $contentString is not string but SimpleXMLElement. Thus strlen() is returning something nonsensical.

    You need to explicitly cast SimpleXMLElements to string to get the text value of the element:

    $contentString = (string) $newcontent->encoded;
    

    As an aside, you can simplify your DOM expansion and conversion to SimpleXMLElement by using the optional argument to XMLReader::expand():

    $sxe = simplexml_import_dom($reader->expand(new DOMDocument('1.0','UTF-8')));
    

    EDIT with a complete example of your first code block written to do what you want (I think?) As you can see all I did was take the inner loop from your second code example and put it in the inner loop in your first code example.

    $reader = new XMLReader();
    $reader->open($path_to_xml_file);
    $contentNS = 'http://purl.org/rss/1.0/modules/content/';
    while($reader->read()) {
        if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
            $xml = simplexml_import_dom($reader->expand(new DOMDocument('1.0', 'UTF-8')));
            $titleString = (string) $xml->title;
            $contentString = (string) $xml->children($contentNS)->encoded;
            if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
                // Be careful with your output escaping!
                // This below looks like it might be wrong:
                // - $titleString for an ID (use slug)
                // - $titleString not escaped
                // - $contentString should be escaped? not sure here.
                // Have you considered using XMLWriter()?
                echo '
    <div class="article-container" id="article-' .  $titleString . '">
        <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $titleString . '</h2>
        <div class="articles">' . $contentString . '</div>
    </div>';
            }
            $reader->next(); //skip the subtrees, go to next item sibling
            // we already expand()ed this so we don't need to walk it.
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 Vue3 大型图片数据拖动排序
  • ¥15 划分vlan后不通了
  • ¥15 GDI处理通道视频时总是带有白色锯齿
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大
  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大