douxia9826 2013-05-21 16:56
浏览 58
已采纳

PHP:DOMDocument:从嵌套元素中删除不需要的文本

I have the following xml document:

<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
    <li>Bulleted style text
        <ul>
            <li>
                <paragraph>1.Sub Bulleted style text</paragraph>
            </li>
        </ul>
    </li>
</ul>
<ul>
    <li>Bulleted style text <strong>bold</strong>
        <ul>
            <li>
                <paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
            </li>
        </ul>
    </li>
</ul>

I need to remove the numbers preceeding the Sub-bulleted text. 1. and 2. in the given example

This is the code I have so far:

<?php
class MyDocumentImporter
{
    const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i';

    protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';

    protected $dom;

    public function processListsText( $loop = null ){

        $this->dom = new DomDocument('1.0', 'UTF-8');

        $this->dom->loadXML($this->xml_string);

        if(!$loop){
            //get all the li tags
            $li_set = $this->dom->getElementsByTagName('li');
        }
        else{
            $li_set = $loop;
        }

        foreach($li_set as $li){

            //check for child nodes
            if(! $li->hasChildNodes() ){
                continue;
            }

            foreach($li->childNodes as $child){
                if( $child->hasChildNodes() ){
                    //this li has children, maybe a <strong> tag
                    $this->processListsText( $child->childNodes );
                }
                if( ! ( $child instanceof DOMElement ) ){
                    continue;
                }
                if( ( $child->localName != 'paragraph') ||  ( $child instanceof DOMText )){
                    continue;
                }
                if( preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0 ){
                    continue;
                }

                $clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);

                //set node to empty
                $child->nodeValue = '';

                //add updated content to node
                $child->appendChild($child->ownerDocument->createTextNode($clean_content));

                //$xml_output = $child->parentNode->ownerDocument->saveXML($child);
                //var_dump($xml_output);

            }
        }
    }
}

$importer = new MyDocumentImporter();
$importer->processListsText();

The issue I can see is that $child->textContent returns the plain text content of the node, and strips the additional child tags. So:

<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>

becomes

<paragraph>Sub Bulleted bold</paragraph>

The <strong> tag is no more.

I'm a little stumped... Can anyone see a way to strip the unwanted characters, and retain the "inner child" <strong> tag?

The tag may not always be <strong>, it could also be a hyperlink <a href="#">, or <emphasize>.

  • 写回答

2条回答 默认 最新

  • douxin2011 2013-05-21 17:16
    关注

    Assuming your XML actually parses, you could use XPath to make your queries a lot easier:

    $xp = new DOMXPath($this->dom);
    
    foreach ($xp->query('//li/paragraph') as $para) {
            $para->firstChild->nodeValue = preg_replace('/^\s*\d+.\s*/', '', $para->firstChild->nodeValue);
    }
    

    It does the text replacement on the first text node instead of the whole tag contents.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么