dongshenjie3055 2016-09-13 10:41
浏览 59
已采纳

如何从用户输入中删除不需要的HTML标记,但使用DOMDocument将文本保留在PHP中的标记内

I have around ~2 Million stored HTML pages in S3 that contain various HTML. I'm trying to extract only the content from those stored pages, but I wish to retain the HTML structure with certain constraints. This HTML is all user-supplied input and should be considered unsafe. So for display purposes, I want to retain only some of the HTML tags with a constraint on attributes and attribute values, but still retain all of the properly encoded text content inside even disallowed tags.

For example, I'd like to allow only specific tags like <p>, <h1>, <h2>, <h3>, <ul>, <ol>, <li>, etc.. But I also want to keep whatever text is found between disallowed tags and maintain its structure. I also want to be able to restrict attributes in each tag or force certain attributes to be applied to specific tags.

For example, in the following HTML...

<div id="content">
  Some text...
  <p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>

I'd like the result to be...

  Some text...
  <p>Hello PHP!</p>

Thus stripping out the unwanted <div> and <span> tags, the unwanted attributes of all tags, and still maintaining the text inside <div> and <span>.

Simply using strip_tags() won't work here. So I tried doing the following with DOMDocuemnt.

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach($dom->childNodes as $node) {
    if ($node->nodeName != "p") { // only allow paragraph tags
        $text = $node->nodeValue;
        $node->parentNode->nodeValue .= $text;
        $node->parentNode->removeChild($node);
    }
}

echo $dom->saveHTML();

Which would work on simple cases where there aren't nested tags, but obviously fails when the HTML is complex.

I can't exactly call this function recursively on each of the node's child nodes because if I delete the node I lose all further nested children. Even if I defer node deletion until after the recursion the order of text insertion becomes tricky. Because I try to go deep and return all valid nodes then start concatenating the values of the invalid child nodes together and the result is really messy.

For example, let's say I want to allow <p> and <em> in the following HTML

<p>Hello <strong>there <em>PHP</em>!</strong></p>

But I don't want to allow <strong>. If the <strong> has nested <em> my approach gets really confusing. Because I'd get something like ...

<p>Hello there !<em>PHP</em></p>

Which is obviously wrong. I realized getting the entire nodeValue is a bad way of doing this. So instead I started digging into other ways to go through the entire tree one node at a time. Just finding it very difficult to generalize this solution so that it works sanely every time.

Update

A solution to use strip_tags() or the answer provided here isn't helpful to my use case, because the former does not allow me to control the attributes and the latter removes any tag that has attributes. I don't want to remove any tag that has an attribute. I want to explicitly allow certain tags but still have extensible control over what attributes can be kept/modified in the HTML.

  • 写回答

2条回答 默认 最新

  • doufubian3479 2016-09-13 11:45
    关注

    It seems this problem needs to be broken down into two smaller steps in order to generalize the solution.

    First, Walking the DOM Tree

    In order to get to a working solution I found I need to have a sensible way to traverse every node in the DOM tree and inspect it in order to determine if it should be kept as-is or modified.

    So I used wrote the following method as a simple generator extending from DOMDocument.

    class HTMLFixer extends DOMDocument {
        public function walk(DOMNode $node, $skipParent = false) {
            if (!$skipParent) {
                yield $node;
            }
            if ($node->hasChildNodes()) {
                foreach ($node->childNodes as $n) {
                    yield from $this->walk($n);
                }
            }
        }
    }
    

    This way doing something like foreach($dom->walk($dom) as $node) gives me a simple loop to traverse the entire tree. Of course this is a PHP 7 only solution because of the yield from syntax, but I'm OK with that.

    Second, Removing Tags but Keeping their Text

    The tricky part was figuring out how to keep the text and not the tag while making modifications inside the loop. So after struggling with a few different approaches I found the simplest way was to build a list of tags to be removed from inside the loop and then remove them later using DOMNode::insertBefore() to append the text nodes up the tree. That way removing those nodes later has no side effects.

    So I added another generalized stripTags method to this child class for DOMDocument.

    public function stripTags(DOMNode $node) {
        $change = $remove = [];
    
        /* Walk the entire tree to build a list of things that need removed */
        foreach($this->walk($node) as $n) {
            if ($n instanceof DOMText || $n instanceof DOMDocument) {
                continue;
            }
            $this->stripAttributes($n); // strips all node attributes not allowed
            $this->forceAttributes($n); // forces any required attributes
            if (!in_array($n->nodeName, $this->allowedTags, true)) {
                // track the disallowed node for removal
                $remove[] = $n;
                // we take all of its child nodes for modification later
                foreach($n->childNodes as $child) {
                    $change[] = [$child, $n];
                }
            }
        }
    
        /* Go through the list of changes first so we don't break the
           referential integrity of the tree */
        foreach($change as list($a, $b)) {
            $b->parentNode->insertBefore($a, $b);
        }
    
        /* Now we can safely remove the old nodes */
        foreach($remove as $a) {
            if ($a->parentNode) {
                $a->parentNode->removeChild($a);
            }
        }
    }
    

    The trick here is because we use insertBefore, on the child nodes (i.e. text node) of the disallowed tags, to move them up to the parent tag, we could easily break the tree (we're copying). This confused me a lot at first, but looking at the way the method works, it makes sense. Deferring the move of the node makes sure we don't break parentNode reference when the deeper node is the one that's allowed, but its parent is not in the allowed tags list for example.

    Complete Solution

    Here's the complete solution I came up with to more generally solve this problem. I'll include in my answer since I struggled to find a lot of the edge cases in doing this with DOMDocument elsewhere. It allows you to specify which tags to allow, and all other tags are removed. It also allows you to specify which attributes are allowed and all other attributes can be removed (even forcing certain attributes on certain tags).

    class HTMLFixer extends DOMDocument {
        protected static $defaultAllowedTags = [
            'p',
            'h1',
            'h2',
            'h3',
            'h4',
            'h5',
            'h6',
            'pre',
            'code',
            'blockquote',
            'q',
            'strong',
            'em',
            'del',
            'img',
            'a',
            'table',
            'thead',
            'tbody',
            'tfoot',
            'tr',
            'th',
            'td',
            'ul',
            'ol',
            'li',
        ];
        protected static $defaultAllowedAttributes = [
            'a'   => ['href'],
            'img' => ['src'],
            'pre' => ['class'],
        ];
        protected static $defaultForceAttributes = [
            'a' => ['target' => '_blank'],
        ];
    
        protected $allowedTags       = [];
        protected $allowedAttributes = [];
        protected $forceAttributes   = [];
    
        public function __construct($version = null, $encoding = null, $allowedTags = [],
                                    $allowedAttributes = [], $forceAttributes = []) {
            $this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
            $this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
            $this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
            parent::__construct($version, $encoding);
        }
    
        public function setAllowedTags(Array $tags) {
            $this->allowedTags = $tags;
        }
    
        public function setAllowedAttributes(Array $attributes) {
            $this->allowedAttributes = $attributes;
        }
    
        public function setForceAttributes(Array $attributes) {
            $this->forceAttributes = $attributes;
        }
    
        public function getAllowedTags() {
            return $this->allowedTags;
        }
    
        public function getAllowedAttributes() {
            return $this->allowedAttributes;
        }
    
        public function getForceAttributes() {
            return $this->forceAttributes;
        }
    
        public function saveHTML(DOMNode $node = null) {
            if (!$node) {
                $node = $this;
            }
            $this->stripTags($node);
            return parent::saveHTML($node);
        }
    
        protected function stripTags(DOMNode $node) {
            $change = $remove = [];
            foreach($this->walk($node) as $n) {
                if ($n instanceof DOMText || $n instanceof DOMDocument) {
                    continue;
                }
                $this->stripAttributes($n);
                $this->forceAttributes($n);
                if (!in_array($n->nodeName, $this->allowedTags, true)) {
                    $remove[] = $n;
                    foreach($n->childNodes as $child) {
                        $change[] = [$child, $n];
                    }
                }
            }
            foreach($change as list($a, $b)) {
                $b->parentNode->insertBefore($a, $b);
            }
            foreach($remove as $a) {
                if ($a->parentNode) {
                    $a->parentNode->removeChild($a);
                }
            }
        }
    
        protected function stripAttributes(DOMNode $node) {
            $attributes = $node->attributes;
            $len = $attributes->length;
            for ($i = $len - 1; $i >= 0; $i--) {
                $attr = $attributes->item($i);
                if (!isset($this->allowedAttributes[$node->nodeName]) ||
                    !in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
                    $node->removeAttributeNode($attr);
                }
            }
        }
    
        protected function forceAttributes(DOMNode $node) {
            if (isset($this->forceAttributes[$node->nodeName])) {
                foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
                    $node->setAttribute($attribute, $value);
                }
            }
        }
    
        protected function walk(DOMNode $node, $skipParent = false) {
            if (!$skipParent) {
                yield $node;
            }
            if ($node->hasChildNodes()) {
                foreach ($node->childNodes as $n) {
                    yield from $this->walk($n);
                }
            }
        }
    }
    

    So if we have the following HTML

    <div id="content">
      Some text...
      <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
    </div>
    

    And we only want to allow <p>, and <em>.

    $html = <<<'HTML'
        <div id="content">
          Some text...
          <p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
        </div>
    HTML;
    
    $dom = new HTMLFixer(null, null, ['p', 'em']);
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
    echo $dom->saveHTML($dom);
    

    We'd get something like this...

          Some text...
          <p>Hello P<em>H</em>P!</p>
    

    Since you can limit this to a specific subtree in the DOM as well the solution could be generalized even more.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 delphi webbrowser组件网页下拉菜单自动选择问题
  • ¥15 wpf界面一直接收PLC给过来的信号,导致UI界面操作起来会卡顿
  • ¥15 init i2c:2 freq:100000[MAIXPY]: find ov2640[MAIXPY]: find ov sensor是main文件哪里有问题吗
  • ¥15 运动想象脑电信号数据集.vhdr
  • ¥15 三因素重复测量数据R语句编写,不存在交互作用
  • ¥15 微信会员卡等级和折扣规则
  • ¥15 微信公众平台自制会员卡可以通过收款码收款码收款进行自动积分吗
  • ¥15 随身WiFi网络灯亮但是没有网络,如何解决?
  • ¥15 gdf格式的脑电数据如何处理matlab
  • ¥20 重新写的代码替换了之后运行hbuliderx就这样了