I'd like to strip all comments and redundant whitespaces (including line breaks) out of an HTML document via PHP.
I tried using regular expressions for this, but regular expressions seem to be not suited for things like parsing an HTML document. I also tried using DOMDocument, but it seems to also strip conditional comments for IE, which is definitely unwanted. Also, it doesn't strip line breaks nor JavaScript comments and also seems to not include the doctype.
The goal is to save the least amount of bytes needed to parse an HTML document.
My current approaches look like this:
Using regular expressions:
# Works quite well, but would also strip strings that look like comments.
$newHtml = preg_replace('/<!--\s*(?!\[\s*if\s|<\s*!\s*\[\s*endif\s*\]).*?-->/is', '', $oldHtml);
# Works, but would also strip intended whitespaces within <pre> elements
$newHtml = preg_replace('/\s+/', ' ', $oldHtml);
# Has one major side effect: JavaScript comments with double slashes (//)
# will lead to the rest of the script being commented as well.
$newHtml = preg_replace('/|
/', '', $oldHtml);
Using DOMDocument:
$doc = new DOMDocument('5', 'UTF-8');
$doc->loadHTML($oldHtml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment) {
# Also strips conditional comments for IE... uncool.
$comment->parentNode->removeChild($comment);
}
$newHtml = '<!DOCTYPE html>'; # Do I really need to do this manually?
$newHtml .= $doc->saveHTML($xpath->query('//html')->item(0));