douyi8732 2014-09-05 22:38
浏览 85
已采纳

PHP和XML:如何删除“终端元素”之外的所有空格

First let's define "terminal element" (for the particular purpose of this question).

By "terminal element" I mean the elements that contain no other elements inside.

Element reference: http://www.w3schools.com/xml/xml_elements.asp

How to remove from a XML document/node all whitespaces (line feeds, carriage returns, tabs and spaces) that are outside "terminal elements" with PHP?

Rules: Only PHP native XML parsers (no regex).

  • 写回答

2条回答 默认 最新

  • dqu92800 2015-03-06 18:19
    关注

    All whitespace outside "terminal elements" (leaf element nodes) is within text-nodes (as all text is within text-nodes). So if you get all text-nodes that are outside of terminal elements, you can remove all whitespace-characters from those. This is the answer already.

    Let's start lightly by just removing whitespace from one text-node in an XML Document.

    As PHP uses UTF-8 as character encoding for the XML parsers (I use DOMDocument in this example), preg_replace is handy here as it knows both UTF-8 and what whitespace characters are:

    /** @var DomText $text */
    $text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
    

    This removes all whitespace from a text-node. Here is a demonstration of that:

    $doc = new DOMDocument();
    $doc->loadXML('<root> Very Simple Demo </root>');
    
    $text = $doc->documentElement->childNodes->item(0);
    
    /** @var DomText $text */
    $text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
    
    $doc->save('php://output');
    

    Output:

    <?xml version="1.0"?>
    <root>VerySimpleDemo</root>
    

    As you can see the space characters are removed from the one and only text-node that is part of that document.

    With a larger document and your "terminal elements", this is naturally more interesting, but works pretty much the same. The only difference is to get all text-node that are not part of leaf-element-nodes. This is best done with an xpath query:

    //*[*]/text()
    

    This reads: All text-nodes that are children of element that contain other elements. Let's use the following XML (file content.xml) as an example:

    <?xml version="1.0"?>
    <content>
        <parent>
            <child id="1">
                <title>child 1</title>
    
                <child id="1">
                    <title>
                        child 1.1 with whitespace
                    </title>
                </child>
            </child>
        </parent>
    </content>
    

    It contains both such leaf-elements as well as other elements that have child-elements. It also shows pretty well the whitespace as it's used for element indentation.

    After loading it:

    $file = __DIR__ . '/content.xml';
    
    $doc = new DOMDocument();
    $doc->load($file);
    

    A DOMXPath is necessary to execute the xpath-query:

    $xp    = new DOMXPath($doc);
    $texts = $xp->query('//*[*]/text()');
    

    What's left is to iterate over all those text-nodes and apply the whitespace removal as above:

    foreach ($texts as $text) {
        /** @var DomText $text */
        $text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
    }
    

    The result then is:

    <?xml version="1.0"?>
    <content><parent><child id="1"><title>child 1</title><child id="1"><title>
                        child 1.1 with whitespace
                    </title></child></child></parent></content>
    

    This should answer the question. But it wouldn't be XML if there wouldn't be a little bit more verbosity or a little kind of "but...".

    Note that "text()" in xpath represents all kind of text-nodes incl. CDATA sections. If a CDATA section contains of whitespace only, the code above renders an empty CDATA section ("<![CDATA[]]>") into the output. One way to deal with that is to remove the the empty nodes from the document:

    /** @var DomText $text */
    $text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
    if (!$text->length) {
        $text->parentNode->removeChild($text);
    }
    

    This then removes all emptied text-nodes form the document then. Keeping the document tree tidy. Hope this helps.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 ue5运行的通道视频都会有白色锯齿
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大
  • ¥15 Revit2020下载问题
  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大
  • ¥15 单片机无法进入HAL_TIM_PWM_PulseFinishedCallback回调函数