PHP - 剥离注释和冗余空格 - 最佳实践

I'd like to strip all comments and redundant whitespaces (including line breaks) out of an HTML document via PHP.

I tried using regular expressions for this, but regular expressions seem to be not suited for things like parsing an HTML document. I also tried using DOMDocument, but it seems to also strip conditional comments for IE, which is definitely unwanted. Also, it doesn't strip line breaks nor JavaScript comments and also seems to not include the doctype.

The goal is to save the least amount of bytes needed to parse an HTML document.

My current approaches look like this:

Using regular expressions:

# Works quite well, but would also strip strings that look like comments.
$newHtml = preg_replace('/<!--\s*(?!\[\s*if\s|<\s*!\s*\[\s*endif\s*\]).*?-->/is', '', $oldHtml);

# Works, but would also strip intended whitespaces within <pre> elements
$newHtml = preg_replace('/\s+/', ' ', $oldHtml);

# Has one major side effect: JavaScript comments with double slashes (//)
# will lead to the rest of the script being commented as well.
$newHtml = preg_replace('/|
/', '', $oldHtml);

Using DOMDocument:

$doc   = new DOMDocument('5', 'UTF-8');
$doc->loadHTML($oldHtml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment) {
    # Also strips conditional comments for IE... uncool.
    $comment->parentNode->removeChild($comment);
}
$newHtml  = '<!DOCTYPE html>'; # Do I really need to do this manually?
$newHtml .= $doc->saveHTML($xpath->query('//html')->item(0));

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

剥离空白PHP [重复] php
2013-03-04 23:05

回答 1 已采纳 The problem is the . This is a non-breakable space in your browser. Use this: $string = "Pa
PHP - DOM，从html中剥离表[关闭] html php
2012-12-18 02:15

回答 2 已采纳 You aren't loading the HTML into your DOMDocument at all. Remove this line $html = str_get_htm
PHP DOM对象有一些自定义属性被剥离 php
2016-11-08 05:04

回答 1 已采纳 So the issue was not a custom attribute. This custom attribute was inserted later via javascript,
[学习笔记]黑马程序员-Hadoop入门视频教程
2023-02-03 17:59

N刻后告诉你的博客文章目录参考资料第一章：大数据导论与Linux基础（p1-p17） 1.1 大数据导论 1.1.1 企业数据分析方向 1.1.2 数据分析基本流程步骤明确分析的目的和思路数据收集数据处理数据分析数据展现报告攥写 1.1.3 ...
在移动设备上使用PHP上传图片时剥离GPS标题？ iphone php
2013-04-18 13:27

回答 1 已采纳 Only thing you can do is to create a native app (or Apache Cordova app), that will handle the uplo
HTML标签在PHP AJAX中被剥离 ajax javascript jquery php
2016-04-06 05:07

回答 1 已采纳 Try $f3->scrub($_POST,'p; br; span; div; b; a');
Simple Dom Parser - 从结果中剥离链接和特定div html php
2014-04-11 12:24

回答 1 已采纳 use outertext = : $div->outertext = ''; $a->outertext = $a->text();
[Spark版本升级]-- spark-2.2.0发行说明
2017-07-18 10:31

往事随风ing的博客 ] - 修复路径相关和JarEntry相关的测试失败，并跳过一些测试在Windows上由于路径长度限制失败 [ SPARK-18817 ] - 默认情况下，确保没有写入R的tempdir（）之外 [ SPARK-18830 ] - 在PipedRDDSuite中...
PHP正则表达式匹配特定的URL并剥离其他URL php
2016-03-26 09:36

回答 3 已采纳 Edited based on change in your question. The problem is your .* at the end of your regex, so my s
PHP从HTML剥离[关闭] html php
2013-11-18 20:17

回答 2 已采纳 All you need is DOM and XPath: $dom = new DOMDocument; @$dom->loadHTMLFile('http://areacode.or
PHP - 如何使用trim可选的character_mask去除我想要剥离的所有字符 php
2015-11-21 09:16

回答 5 已采纳 trim() strips the listed characters from the beginning and/or end of the string as described in th
python知识点大全-2
2022-08-30 00:06

阿煜酱~的博客 GBK编码为了让计算机能够识别中文和英文，中国人定制了GBK GBK表的特点只有中文字符、英文字符与数字的一一对应关系一个英文字符对应1Bytes，一个中文字符对应2Bytes 1Bytes=8bit，8bit最多包含256个数字，可以...
从多维数组PHP剥离层 php
2014-02-21 15:28

回答 4 已采纳 This works: $items = array( array( array("Color" => "Blue"), array("Size"
《MySQL性能优化和高可用架构实践》阅读总结
2021-09-19 21:43

悬浮海的博客 13.6　Mycat读写分离实战【*】介绍本篇内容摘自《MySQL性能优化和高可用架构实践-宋立桓-清华大学出版社》主要包括书中个人认为的重点部分。第1章　MySQL架构介绍 1.1　MySQL简介 1.2　MySQL主流的分支版本 1.3...
ElasticSearch全文检索-从零到入门
2022-01-09 15:57

象牙酥的博客对于英文来说比较简单按空格分隔即可，两份文档共提取到4个关键字:I、love、elasticsearch和logstash。第二部：接下来就是建立关键字与文档之间的对应关系，即标识关键字都被哪些文档包含。这里使用如下表所示的...
Java面试八股文（素材来自网络）
2021-07-14 11:05

不会起名字啦的博客多线程线程和进程区别多进程和多线程区别进程间通信和线程间通信区别线程状态守护线程和用户线程进程状态创建多线程序列化 Thread与Runnable 执行流程线程安全线程池异常处理 RuntimeException和非...
深入探索 Android 包体积优化（匠心制作-上）
2021-07-30 11:31

Android开发好多年的博客 Dex 压缩 7、三方库处理 8、移除无用代码 9、避免产生 Java access 方法 10、利用 ByteX Gradle 插件平台中的代码优化插件 11、小结三、资源瘦身方案探索 1、冗余资源优化 2、重复资源优化 3、图片压缩 4、使用针对...
spark-2.2.0发行说明
2019-12-24 15:01

浅汐王的博客 ] - 修复路径相关和JarEntry相关的测试失败，并跳过一些测试在Windows上由于路径长度限制失败 [ SPARK-18817 ] - 默认情况下，确保没有写入R的tempdir（）之外 [ SPARK-18830 ] - 在PipedRDDSuite中...
没有解决我的问题, 去提问

悬赏问题

¥15 delta降尺度计算的一些细节，有偿
¥15 Arduino红外遥控代码有问题
¥15 数值计算离散正交多项式
¥30 数值计算均差系数编程
¥15 redis-full-check比较两个集群的数据出错
¥15 Matlab编程问题
¥15 训练的多模态特征融合模型准确度很低怎么办
¥15 kylin启动报错log4j类冲突
¥15 超声波模块测距控制点灯，灯的闪烁很不稳定，经过调试发现测的距离偏大
¥15 import arcpy出现importing _arcgisscripting 找不到相关程序

码龄粉丝数原力等级 --

PHP - 剥离注释和冗余空格 - 最佳实践

0条回答默认最新

悬赏问题

PHP - 剥离注释和冗余空格 - 最佳实践

0条回答 默认 最新

悬赏问题

0条回答默认最新