It seems this problem needs to be broken down into two smaller steps in order to generalize the solution.
First, Walking the DOM Tree
In order to get to a working solution I found I need to have a sensible way to traverse every node in the DOM tree and inspect it in order to determine if it should be kept as-is or modified.
So I used wrote the following method as a simple generator extending from DOMDocument
.
class HTMLFixer extends DOMDocument {
public function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
This way doing something like foreach($dom->walk($dom) as $node)
gives me a simple loop to traverse the entire tree. Of course this is a PHP 7 only solution because of the yield from
syntax, but I'm OK with that.
Second, Removing Tags but Keeping their Text
The tricky part was figuring out how to keep the text and not the tag while making modifications inside the loop. So after struggling with a few different approaches I found the simplest way was to build a list of tags to be removed from inside the loop and then remove them later using DOMNode::insertBefore()
to append the text nodes up the tree. That way removing those nodes later has no side effects.
So I added another generalized stripTags
method to this child class for DOMDocument
.
public function stripTags(DOMNode $node) {
$change = $remove = [];
/* Walk the entire tree to build a list of things that need removed */
foreach($this->walk($node) as $n) {
if ($n instanceof DOMText || $n instanceof DOMDocument) {
continue;
}
$this->stripAttributes($n); // strips all node attributes not allowed
$this->forceAttributes($n); // forces any required attributes
if (!in_array($n->nodeName, $this->allowedTags, true)) {
// track the disallowed node for removal
$remove[] = $n;
// we take all of its child nodes for modification later
foreach($n->childNodes as $child) {
$change[] = [$child, $n];
}
}
}
/* Go through the list of changes first so we don't break the
referential integrity of the tree */
foreach($change as list($a, $b)) {
$b->parentNode->insertBefore($a, $b);
}
/* Now we can safely remove the old nodes */
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
}
The trick here is because we use insertBefore
, on the child nodes (i.e. text node) of the disallowed tags, to move them up to the parent tag, we could easily break the tree (we're copying). This confused me a lot at first, but looking at the way the method works, it makes sense. Deferring the move of the node makes sure we don't break parentNode
reference when the deeper node is the one that's allowed, but its parent is not in the allowed tags list for example.
Complete Solution
Here's the complete solution I came up with to more generally solve this problem. I'll include in my answer since I struggled to find a lot of the edge cases in doing this with DOMDocument elsewhere. It allows you to specify which tags to allow, and all other tags are removed. It also allows you to specify which attributes are allowed and all other attributes can be removed (even forcing certain attributes on certain tags).
class HTMLFixer extends DOMDocument {
protected static $defaultAllowedTags = [
'p',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'pre',
'code',
'blockquote',
'q',
'strong',
'em',
'del',
'img',
'a',
'table',
'thead',
'tbody',
'tfoot',
'tr',
'th',
'td',
'ul',
'ol',
'li',
];
protected static $defaultAllowedAttributes = [
'a' => ['href'],
'img' => ['src'],
'pre' => ['class'],
];
protected static $defaultForceAttributes = [
'a' => ['target' => '_blank'],
];
protected $allowedTags = [];
protected $allowedAttributes = [];
protected $forceAttributes = [];
public function __construct($version = null, $encoding = null, $allowedTags = [],
$allowedAttributes = [], $forceAttributes = []) {
$this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
$this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
$this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
parent::__construct($version, $encoding);
}
public function setAllowedTags(Array $tags) {
$this->allowedTags = $tags;
}
public function setAllowedAttributes(Array $attributes) {
$this->allowedAttributes = $attributes;
}
public function setForceAttributes(Array $attributes) {
$this->forceAttributes = $attributes;
}
public function getAllowedTags() {
return $this->allowedTags;
}
public function getAllowedAttributes() {
return $this->allowedAttributes;
}
public function getForceAttributes() {
return $this->forceAttributes;
}
public function saveHTML(DOMNode $node = null) {
if (!$node) {
$node = $this;
}
$this->stripTags($node);
return parent::saveHTML($node);
}
protected function stripTags(DOMNode $node) {
$change = $remove = [];
foreach($this->walk($node) as $n) {
if ($n instanceof DOMText || $n instanceof DOMDocument) {
continue;
}
$this->stripAttributes($n);
$this->forceAttributes($n);
if (!in_array($n->nodeName, $this->allowedTags, true)) {
$remove[] = $n;
foreach($n->childNodes as $child) {
$change[] = [$child, $n];
}
}
}
foreach($change as list($a, $b)) {
$b->parentNode->insertBefore($a, $b);
}
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
}
protected function stripAttributes(DOMNode $node) {
$attributes = $node->attributes;
$len = $attributes->length;
for ($i = $len - 1; $i >= 0; $i--) {
$attr = $attributes->item($i);
if (!isset($this->allowedAttributes[$node->nodeName]) ||
!in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
$node->removeAttributeNode($attr);
}
}
}
protected function forceAttributes(DOMNode $node) {
if (isset($this->forceAttributes[$node->nodeName])) {
foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
$node->setAttribute($attribute, $value);
}
}
}
protected function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
So if we have the following HTML
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>
And we only want to allow <p>
, and <em>
.
$html = <<<'HTML'
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>
HTML;
$dom = new HTMLFixer(null, null, ['p', 'em']);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $dom->saveHTML($dom);
We'd get something like this...
Some text...
<p>Hello P<em>H</em>P!</p>
Since you can limit this to a specific subtree in the DOM as well the solution could be generalized even more.