dongshuo1856 2015-05-28 10:04
浏览 37

PHP - 剥离注释和冗余空格 - 最佳实践

I'd like to strip all comments and redundant whitespaces (including line breaks) out of an HTML document via PHP.

I tried using regular expressions for this, but regular expressions seem to be not suited for things like parsing an HTML document. I also tried using DOMDocument, but it seems to also strip conditional comments for IE, which is definitely unwanted. Also, it doesn't strip line breaks nor JavaScript comments and also seems to not include the doctype.

The goal is to save the least amount of bytes needed to parse an HTML document.

My current approaches look like this:

Using regular expressions:

# Works quite well, but would also strip strings that look like comments.
$newHtml = preg_replace('/<!--\s*(?!\[\s*if\s|<\s*!\s*\[\s*endif\s*\]).*?-->/is', '', $oldHtml);

# Works, but would also strip intended whitespaces within <pre> elements
$newHtml = preg_replace('/\s+/', ' ', $oldHtml);

# Has one major side effect: JavaScript comments with double slashes (//)
# will lead to the rest of the script being commented as well.
$newHtml = preg_replace('/|
/', '', $oldHtml);

Using DOMDocument:

$doc   = new DOMDocument('5', 'UTF-8');
$doc->loadHTML($oldHtml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment) {
    # Also strips conditional comments for IE... uncool.
    $comment->parentNode->removeChild($comment);
}
$newHtml  = '<!DOCTYPE html>'; # Do I really need to do this manually?
$newHtml .= $doc->saveHTML($xpath->query('//html')->item(0));
  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 delta降尺度计算的一些细节,有偿
    • ¥15 Arduino红外遥控代码有问题
    • ¥15 数值计算离散正交多项式
    • ¥30 数值计算均差系数编程
    • ¥15 redis-full-check比较 两个集群的数据出错
    • ¥15 Matlab编程问题
    • ¥15 训练的多模态特征融合模型准确度很低怎么办
    • ¥15 kylin启动报错log4j类冲突
    • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大
    • ¥15 import arcpy出现importing _arcgisscripting 找不到相关程序