All whitespace outside "terminal elements" (leaf element nodes) is within text-nodes (as all text is within text-nodes). So if you get all text-nodes that are outside of terminal elements, you can remove all whitespace-characters from those. This is the answer already.
Let's start lightly by just removing whitespace from one text-node in an XML Document.
As PHP uses UTF-8 as character encoding for the XML parsers (I use DOMDocument in this example), preg_replace
is handy here as it knows both UTF-8 and what whitespace characters are:
/** @var DomText $text */
$text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
This removes all whitespace from a text-node. Here is a demonstration of that:
$doc = new DOMDocument();
$doc->loadXML('<root> Very Simple Demo </root>');
$text = $doc->documentElement->childNodes->item(0);
/** @var DomText $text */
$text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
$doc->save('php://output');
Output:
<?xml version="1.0"?>
<root>VerySimpleDemo</root>
As you can see the space characters are removed from the one and only text-node that is part of that document.
With a larger document and your "terminal elements", this is naturally more interesting, but works pretty much the same. The only difference is to get all text-node that are not part of leaf-element-nodes. This is best done with an xpath query:
//*[*]/text()
This reads: All text-nodes that are children of element that contain other elements. Let's use the following XML (file content.xml
) as an example:
<?xml version="1.0"?>
<content>
<parent>
<child id="1">
<title>child 1</title>
<child id="1">
<title>
child 1.1 with whitespace
</title>
</child>
</child>
</parent>
</content>
It contains both such leaf-elements as well as other elements that have child-elements. It also shows pretty well the whitespace as it's used for element indentation.
After loading it:
$file = __DIR__ . '/content.xml';
$doc = new DOMDocument();
$doc->load($file);
A DOMXPath is necessary to execute the xpath-query:
$xp = new DOMXPath($doc);
$texts = $xp->query('//*[*]/text()');
What's left is to iterate over all those text-nodes and apply the whitespace removal as above:
foreach ($texts as $text) {
/** @var DomText $text */
$text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
}
The result then is:
<?xml version="1.0"?>
<content><parent><child id="1"><title>child 1</title><child id="1"><title>
child 1.1 with whitespace
</title></child></child></parent></content>
This should answer the question. But it wouldn't be XML if there wouldn't be a little bit more verbosity or a little kind of "but...".
Note that "text()
" in xpath represents all kind of text-nodes incl. CDATA sections. If a CDATA section contains of whitespace only, the code above renders an empty CDATA section ("<![CDATA[]]>
") into the output. One way to deal with that is to remove the the empty nodes from the document:
/** @var DomText $text */
$text->nodeValue = preg_replace('~\s+~u', '', $text->textContent);
if (!$text->length) {
$text->parentNode->removeChild($text);
}
This then removes all emptied text-nodes form the document then. Keeping the document tree tidy. Hope this helps.