dongshenjie3055 2016-09-13 10:41

已采纳

如何从用户输入中删除不需要的HTML标记，但使用DOMDocument将文本保留在PHP中的标记内

I have around ~2 Million stored HTML pages in S3 that contain various HTML. I'm trying to extract only the content from those stored pages, but I wish to retain the HTML structure with certain constraints. This HTML is all user-supplied input and should be considered unsafe. So for display purposes, I want to retain only some of the HTML tags with a constraint on attributes and attribute values, but still retain all of the properly encoded text content inside even disallowed tags.

For example, I'd like to allow only specific tags like <p>, <h1>, <h2>, <h3>, <ul>, <ol>, <li>, etc.. But I also want to keep whatever text is found between disallowed tags and maintain its structure. I also want to be able to restrict attributes in each tag or force certain attributes to be applied to specific tags.

For example, in the following HTML...

<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>

I'd like the result to be...

  Some text...
  <p>Hello PHP!</p>

Thus stripping out the unwanted <div> and <span> tags, the unwanted attributes of all tags, and still maintaining the text inside <div> and <span>.

Simply using strip_tags() won't work here. So I tried doing the following with DOMDocuemnt.

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach($dom->childNodes as $node) {
    if ($node->nodeName != "p") { // only allow paragraph tags
        $text = $node->nodeValue;
        $node->parentNode->nodeValue .= $text;
        $node->parentNode->removeChild($node);
    }
}

echo $dom->saveHTML();

Which would work on simple cases where there aren't nested tags, but obviously fails when the HTML is complex.

I can't exactly call this function recursively on each of the node's child nodes because if I delete the node I lose all further nested children. Even if I defer node deletion until after the recursion the order of text insertion becomes tricky. Because I try to go deep and return all valid nodes then start concatenating the values of the invalid child nodes together and the result is really messy.

For example, let's say I want to allow <p> and <em> in the following HTML

<p>Hello <strong>there <em>PHP</em>!</strong></p>

But I don't want to allow <strong>. If the <strong> has nested <em> my approach gets really confusing. Because I'd get something like ...

<p>Hello there !<em>PHP</em></p>

Which is obviously wrong. I realized getting the entire nodeValue is a bad way of doing this. So instead I started digging into other ways to go through the entire tree one node at a time. Just finding it very difficult to generalize this solution so that it works sanely every time.

Update

A solution to use strip_tags() or the answer provided here isn't helpful to my use case, because the former does not allow me to control the attributes and the latter removes any tag that has attributes. I don't want to remove any tag that has an attribute. I want to explicitly allow certain tags but still have extensible control over what attributes can be kept/modified in the HTML.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

doufubian3479 2016-09-13 11:45

关注

It seems this problem needs to be broken down into two smaller steps in order to generalize the solution.

First, Walking the DOM Tree

In order to get to a working solution I found I need to have a sensible way to traverse every node in the DOM tree and inspect it in order to determine if it should be kept as-is or modified.

So I used wrote the following method as a simple generator extending from DOMDocument.

class HTMLFixer extends DOMDocument {
    public function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

This way doing something like foreach($dom->walk($dom) as $node) gives me a simple loop to traverse the entire tree. Of course this is a PHP 7 only solution because of the yield from syntax, but I'm OK with that.

Second, Removing Tags but Keeping their Text

The tricky part was figuring out how to keep the text and not the tag while making modifications inside the loop. So after struggling with a few different approaches I found the simplest way was to build a list of tags to be removed from inside the loop and then remove them later using DOMNode::insertBefore() to append the text nodes up the tree. That way removing those nodes later has no side effects.

So I added another generalized stripTags method to this child class for DOMDocument.

public function stripTags(DOMNode $node) {
    $change = $remove = [];

    /* Walk the entire tree to build a list of things that need removed */
    foreach($this->walk($node) as $n) {
        if ($n instanceof DOMText || $n instanceof DOMDocument) {
            continue;
        }
        $this->stripAttributes($n); // strips all node attributes not allowed
        $this->forceAttributes($n); // forces any required attributes
        if (!in_array($n->nodeName, $this->allowedTags, true)) {
            // track the disallowed node for removal
            $remove[] = $n;
            // we take all of its child nodes for modification later
            foreach($n->childNodes as $child) {
                $change[] = [$child, $n];
            }
        }
    }

    /* Go through the list of changes first so we don't break the
       referential integrity of the tree */
    foreach($change as list($a, $b)) {
        $b->parentNode->insertBefore($a, $b);
    }

    /* Now we can safely remove the old nodes */
    foreach($remove as $a) {
        if ($a->parentNode) {
            $a->parentNode->removeChild($a);
        }
    }
}

The trick here is because we use insertBefore, on the child nodes (i.e. text node) of the disallowed tags, to move them up to the parent tag, we could easily break the tree (we're copying). This confused me a lot at first, but looking at the way the method works, it makes sense. Deferring the move of the node makes sure we don't break parentNode reference when the deeper node is the one that's allowed, but its parent is not in the allowed tags list for example.

Complete Solution

Here's the complete solution I came up with to more generally solve this problem. I'll include in my answer since I struggled to find a lot of the edge cases in doing this with DOMDocument elsewhere. It allows you to specify which tags to allow, and all other tags are removed. It also allows you to specify which attributes are allowed and all other attributes can be removed (even forcing certain attributes on certain tags).

class HTMLFixer extends DOMDocument {
    protected static $defaultAllowedTags = [
        'p',
        'h1',
        'h2',
        'h3',
        'h4',
        'h5',
        'h6',
        'pre',
        'code',
        'blockquote',
        'q',
        'strong',
        'em',
        'del',
        'img',
        'a',
        'table',
        'thead',
        'tbody',
        'tfoot',
        'tr',
        'th',
        'td',
        'ul',
        'ol',
        'li',
    ];
    protected static $defaultAllowedAttributes = [
        'a'   => ['href'],
        'img' => ['src'],
        'pre' => ['class'],
    ];
    protected static $defaultForceAttributes = [
        'a' => ['target' => '_blank'],
    ];

    protected $allowedTags       = [];
    protected $allowedAttributes = [];
    protected $forceAttributes   = [];

    public function __construct($version = null, $encoding = null, $allowedTags = [],
                                $allowedAttributes = [], $forceAttributes = []) {
        $this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
        $this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
        $this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
        parent::__construct($version, $encoding);
    }

    public function setAllowedTags(Array $tags) {
        $this->allowedTags = $tags;
    }

    public function setAllowedAttributes(Array $attributes) {
        $this->allowedAttributes = $attributes;
    }

    public function setForceAttributes(Array $attributes) {
        $this->forceAttributes = $attributes;
    }

    public function getAllowedTags() {
        return $this->allowedTags;
    }

    public function getAllowedAttributes() {
        return $this->allowedAttributes;
    }

    public function getForceAttributes() {
        return $this->forceAttributes;
    }

    public function saveHTML(DOMNode $node = null) {
        if (!$node) {
            $node = $this;
        }
        $this->stripTags($node);
        return parent::saveHTML($node);
    }

    protected function stripTags(DOMNode $node) {
        $change = $remove = [];
        foreach($this->walk($node) as $n) {
            if ($n instanceof DOMText || $n instanceof DOMDocument) {
                continue;
            }
            $this->stripAttributes($n);
            $this->forceAttributes($n);
            if (!in_array($n->nodeName, $this->allowedTags, true)) {
                $remove[] = $n;
                foreach($n->childNodes as $child) {
                    $change[] = [$child, $n];
                }
            }
        }
        foreach($change as list($a, $b)) {
            $b->parentNode->insertBefore($a, $b);
        }
        foreach($remove as $a) {
            if ($a->parentNode) {
                $a->parentNode->removeChild($a);
            }
        }
    }

    protected function stripAttributes(DOMNode $node) {
        $attributes = $node->attributes;
        $len = $attributes->length;
        for ($i = $len - 1; $i >= 0; $i--) {
            $attr = $attributes->item($i);
            if (!isset($this->allowedAttributes[$node->nodeName]) ||
                !in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
                $node->removeAttributeNode($attr);
            }
        }
    }

    protected function forceAttributes(DOMNode $node) {
        if (isset($this->forceAttributes[$node->nodeName])) {
            foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
                $node->setAttribute($attribute, $value);
            }
        }
    }

    protected function walk(DOMNode $node, $skipParent = false) {
        if (!$skipParent) {
            yield $node;
        }
        if ($node->hasChildNodes()) {
            foreach ($node->childNodes as $n) {
                yield from $this->walk($n);
            }
        }
    }
}

So if we have the following HTML

<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>

And we only want to allow <p>, and <em>.

$html = <<<'HTML'
    <div id="content">
      Some text...
      <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
    </div>
HTML;

$dom = new HTMLFixer(null, null, ['p', 'em']);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

echo $dom->saveHTML($dom);

We'd get something like this...

      Some text...
      <p>Hello P<em>H</em>P!</p>

Since you can limit this to a specific subtree in the DOM as well the solution could be generalized even more.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

如何从用户输入中删除不需要的HTML标记，但使用DOMDocument将文本保留在PHP中的标记内 html php
2016-09-13 10:41

回答 2 已采纳 It seems this problem needs to be broken down into two smaller steps in order to generalize the so
PHP使用DOMDocument和/或Regex从HTML中提取URL php
2018-09-26 15:31

回答 1 已采纳 I think you can use regex to fetch this value which will be easier. $txt = <<<TXT <ht
将“Image”标记替换为“a”标记PHP DOMDocument html php
2019-02-26 06:41

回答 3 已采纳 This is a case of when you alter the content of the document your iterating over a (your list of t
web前端之HTML超文本标记语言
2020-02-17 21:32

C_teacher的博客文章目录01 Web前端开发介绍Web概述Web起源Web的特点Web工作原理URL介绍02 HTML网页结构HTML的基本结构文档类型文档规范与注释03 HTML标签文本标签换行标签列表标签div与span标签图片标签img超链接标签表格标签form...
使用DOMDocument在PHP中刮取特定标记属性 php
2015-01-21 06:37

回答 1 已采纳 XPath is your friend here. An expression like //meta[starts-with(@property, "og")]/@content will g
使用DOMDocument在现有HTML表中创建新列 php
2017-12-16 17:48

回答 1 已采纳 Remember you cant/shouldnt have multiple ids on the same page. Then just use $tr->insertBefore
PHP简单HTML DOM - 如何获取标记内的文本 html php
2016-04-02 09:04

回答 1 已采纳 try: innertext() innertext used for Read or write the inner HTML text of element. foreach($ht
【前端】解析HTML并处理特殊符号：前端封装的实用工具函数
2023-02-25 11:40

爱吃芋圆的兔子的博客在前端开发中，经常会遇到需要解析HTML文本并处理特殊符号的情况，例如在展示富文本内容或处理用户输入。为了提高开发效率和代码质量，我们可以封装一些实用的工具函数来处理这些需求。本文将为您介绍如何使用前端...
使用PHP DOMDocument从XML中获取标记值 php xml
2014-05-22 16:50

回答 1 已采纳 Here's a solution for you using DOMXpath: <?php $xml = <<<XML <Klassen> &
如何在PHP中使用DomDocument或XPath获取HTML文档的确切结构？ html php
2015-07-19 15:07

回答 1 已采纳 Suppose, $str contains the HTML // Create DomDocument $doc = new DomDocument(); $doc->loadH
如何使用PHP DOMDocument（）检索子元素中的值？ php
2019-06-17 18:20

回答 1 已采纳 What you can do is to look at the next element from the <img> tag (using nextSibling) and if
Js 正则表达式截取html内容,如何从JavaScript中的字符串中剥离HTML(仅提取文本内容)...
2021-06-08 15:13

闵明的博客本文概述通常, 在服务器端, 你可以使用一系列PHP函数(例如strip_tags)并删除HTML和难看的格式。但是, 如果你无法使用服务器(或使用Node.js)来完成此任务, 则仍可以使用Javascript来完成。在本文中, 你将找到3种从...
PHP：DOMDocument：从嵌套元素中删除不需要的文本 php
2013-05-21 16:56

回答 2 已采纳 Assuming your XML actually parses, you could use XPath to make your queries a lot easier: $xp = n
2024前端面试题总汇（持续更新中...）
2023-09-26 06:11

小菜猿_的博客前端面试八股文大全！！！！！
您如何在PHP中解析和处理HTML / XML？
2019-12-04 10:40

asdfgh0077的博客如何解析HTML / XML并从中提取信息？
没有解决我的问题, 去提问

悬赏问题

¥15 delphi webbrowser组件网页下拉菜单自动选择问题
¥15 wpf界面一直接收PLC给过来的信号，导致UI界面操作起来会卡顿
¥15 init i2c:2 freq:100000[MAIXPY]: find ov2640[MAIXPY]: find ov sensor是main文件哪里有问题吗
¥15 运动想象脑电信号数据集.vhdr
¥15 三因素重复测量数据R语句编写，不存在交互作用
¥15 微信会员卡等级和折扣规则
¥15 微信公众平台自制会员卡可以通过收款码收款码收款进行自动积分吗
¥15 随身WiFi网络灯亮但是没有网络，如何解决？
¥15 gdf格式的脑电数据如何处理matlab
¥20 重新写的代码替换了之后运行hbuliderx就这样了

码龄粉丝数原力等级 --

如何从用户输入中删除不需要的HTML标记，但使用DOMDocument将文本保留在PHP中的标记内

Update

2条回答默认最新

码龄粉丝数原力等级 --

First, Walking the DOM Tree

Second, Removing Tags but Keeping their Text

Complete Solution

悬赏问题

如何从用户输入中删除不需要的HTML标记，但使用DOMDocument将文本保留在PHP中的标记内

Update

2条回答 默认 最新

First, Walking the DOM Tree

Second, Removing Tags but Keeping their Text

Complete Solution

悬赏问题

2条回答默认最新