duanpin2009 2019-05-05 10:13
浏览 62
已采纳

使用DOMXPath清理已弃用的HTML代码(将嵌套的<div>标记转换为<p>标记)

I'm trying to read Rich Text stored in an old MS Access database into a new PHP web app. The sanitised data will be displayed to users using CKEditor, which is quite strict on parsing standards compliant HTML code. However, the data stored in MS Access is often ill-formatted or uses deprecated HTML code.

Below is an example piece of data I am trying to sanitise:

<div align="right">Previous claim $ &nbsp;&nbsp;935.00<div align="right">&nbsp;&nbsp;This claim $1,572.50</div></div>

This data is meant to be two lines of text that are right-justified, however MS Access has used the deprecated align attribute to style the <div> tags instead of a style attribute, and has incorrectly nested them when in this scenario they should be sequential.

To turn this example data into two lines of text that are both right-justified and that CKEditor will read and display as intended (i.e. text appears as right justified), I am trying to replace the <div> tags with <p> tags, and inject an inline style attribute with right text-align to replace the deprecated align attribute.

I am using PHP's DOMXPath to clean up the data, with the following code:

$dom = new DOMDocument();
$dom->loadHTML($dataForCleaning, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

foreach ($xpath->query('//div[@align]') as $node) {
    $alignment = $node->getAttribute('align');

    $newNode = $dom->createElement('p');
    $newNode->setAttribute("style", "text-align:".$alignment);
    $node->parentNode->insertBefore($newNode, $node);

    foreach ($node->childNodes as $child) {
        $newNode->appendChild($child);
    }

    $node->parentNode->removeChild($node);
}

I am using insertBefore in lieu of appendChild in trying to keep the sequence of elements the same, but this is what's causing the issues in this nested data example.

For non-nested <div> tags as the input data to be cleaned, the sanitised output html is correct. However, in this nested <div> example, the output ends up being:

<p style="text-align:right">Previous claim $ &nbsp;&nbsp;935.00</p>

Note that the second line of text (This claim...) has been removed, as it was within a nested <div> as a child to the parent <div>

I don't mind if the resultant <p> tags remain nested, as CKEditor ends up cleaning these up, but I do need to make sure I'm not losing data like this current code does.

Thanks in advance for any help and guidance. -Mark

  • 写回答

1条回答 默认 最新

  • duanli8391 2019-05-05 11:12
    关注

    There are a couple of things I've changed. The first is that rather than just append the existing node, I get it to clone the node and append the copy (in $newNode->appendChild($child->cloneNode(true));), the second thing I do is as you are moving the enclosed node, I think that the XPath is no longer pointing to this moved node. So instead of that, I check when copying the child nodes if you have the same pattern of a <div align="right"> node and if so I create a new node in the new format and add that instead...

    foreach ($xpath->query('//div[@align]') as $node) {
        $alignment = $node->getAttribute('align');
    
        $newNode = $dom->createElement('p');
        $newNode->setAttribute("style", "text-align:".$alignment);
    
        $node->parentNode->insertBefore($newNode, $node);
        foreach ($node->childNodes as $child) {
            if ( $child instanceof DOMElement && $child->localName == "div"
                    && $child->attributes->getNamedItem("align")->nodeValue == "right" )    {
                $subNode = $dom->createElement('p', $child->nodeValue );
                $subNode->setAttribute("style", "text-align:".$alignment);
                $newNode->appendChild($subNode);
            }
            else    {
                $newNode->appendChild($child->cloneNode(true));
            }
        }
    
        $node->parentNode->removeChild($node);
    }
    

    which for the sample you give will output...

    <p style="text-align:right">
        Previous claim $ &nbsp;&nbsp;935.00
        <p style="text-align:right">&nbsp;&nbsp;This claim $1,572.50</p>
    </p>
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 名为“Product”的列已属于此 DataTable
  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集
  • ¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)(相关搜索:51单片机|单片机|测试代码)
  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题