duanpin2009 2019-05-05 10:13
浏览 62
已采纳

使用DOMXPath清理已弃用的HTML代码(将嵌套的<div>标记转换为<p>标记)

I'm trying to read Rich Text stored in an old MS Access database into a new PHP web app. The sanitised data will be displayed to users using CKEditor, which is quite strict on parsing standards compliant HTML code. However, the data stored in MS Access is often ill-formatted or uses deprecated HTML code.

Below is an example piece of data I am trying to sanitise:

<div align="right">Previous claim $ &nbsp;&nbsp;935.00<div align="right">&nbsp;&nbsp;This claim $1,572.50</div></div>

This data is meant to be two lines of text that are right-justified, however MS Access has used the deprecated align attribute to style the <div> tags instead of a style attribute, and has incorrectly nested them when in this scenario they should be sequential.

To turn this example data into two lines of text that are both right-justified and that CKEditor will read and display as intended (i.e. text appears as right justified), I am trying to replace the <div> tags with <p> tags, and inject an inline style attribute with right text-align to replace the deprecated align attribute.

I am using PHP's DOMXPath to clean up the data, with the following code:

$dom = new DOMDocument();
$dom->loadHTML($dataForCleaning, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

foreach ($xpath->query('//div[@align]') as $node) {
    $alignment = $node->getAttribute('align');

    $newNode = $dom->createElement('p');
    $newNode->setAttribute("style", "text-align:".$alignment);
    $node->parentNode->insertBefore($newNode, $node);

    foreach ($node->childNodes as $child) {
        $newNode->appendChild($child);
    }

    $node->parentNode->removeChild($node);
}

I am using insertBefore in lieu of appendChild in trying to keep the sequence of elements the same, but this is what's causing the issues in this nested data example.

For non-nested <div> tags as the input data to be cleaned, the sanitised output html is correct. However, in this nested <div> example, the output ends up being:

<p style="text-align:right">Previous claim $ &nbsp;&nbsp;935.00</p>

Note that the second line of text (This claim...) has been removed, as it was within a nested <div> as a child to the parent <div>

I don't mind if the resultant <p> tags remain nested, as CKEditor ends up cleaning these up, but I do need to make sure I'm not losing data like this current code does.

Thanks in advance for any help and guidance. -Mark

  • 写回答

1条回答 默认 最新

  • duanli8391 2019-05-05 11:12
    关注

    There are a couple of things I've changed. The first is that rather than just append the existing node, I get it to clone the node and append the copy (in $newNode->appendChild($child->cloneNode(true));), the second thing I do is as you are moving the enclosed node, I think that the XPath is no longer pointing to this moved node. So instead of that, I check when copying the child nodes if you have the same pattern of a <div align="right"> node and if so I create a new node in the new format and add that instead...

    foreach ($xpath->query('//div[@align]') as $node) {
        $alignment = $node->getAttribute('align');
    
        $newNode = $dom->createElement('p');
        $newNode->setAttribute("style", "text-align:".$alignment);
    
        $node->parentNode->insertBefore($newNode, $node);
        foreach ($node->childNodes as $child) {
            if ( $child instanceof DOMElement && $child->localName == "div"
                    && $child->attributes->getNamedItem("align")->nodeValue == "right" )    {
                $subNode = $dom->createElement('p', $child->nodeValue );
                $subNode->setAttribute("style", "text-align:".$alignment);
                $newNode->appendChild($subNode);
            }
            else    {
                $newNode->appendChild($child->cloneNode(true));
            }
        }
    
        $node->parentNode->removeChild($node);
    }
    

    which for the sample you give will output...

    <p style="text-align:right">
        Previous claim $ &nbsp;&nbsp;935.00
        <p style="text-align:right">&nbsp;&nbsp;This claim $1,572.50</p>
    </p>
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器