使用PHP（XPath），PHP / Python（Regex）或Python（XPath）从html中提取信息

I have approx. 40k+ html documents where I need to extract information from. I have tried to do so using PHP+Tidy(because most files are not well-formed)+DOMDocument+XPath but it is extremely slow.... I am advised to use regexp but the html files are not marked up semantically (table based layout, with meaning-less tag/classes used everywhere) and I don't know where i should start...

Just being curious, is using regexp (PHP/Python) faster than using Python's XPath library? Is Xpath library for Python generally faster than PHP's counterpart?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongsheng1238 2009-10-12 13:10
关注
If speed is a requirement have a look at lxml. lxml is a pythonic binding for the libxml2 and libxslt C libraries. Using the C libraries is much faster than any pure php or python version.

There are some impressive benchmarks from Ian Bicking:

In Conclusion

I knew lxml was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

Parsing Results:

Parsing Resutls http://1.2.3.9/bmi/blog.ianbicking.org/wp-content/uploads/images/parsing-results.png

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

使用PHP（XPath），PHP / Python（Regex）或Python（XPath）从html中提取信息 html php python
2009-10-12 09:11

回答 3 已采纳 If speed is a requirement have a look at lxml. lxml is a pythonic binding for the libxml2 and libx
PHP使用DOMDocument和/或Regex从HTML中提取URL php
2018-09-26 15:31

回答 1 已采纳 I think you can use regex to fetch this value which will be easier. $txt = <<<TXT <ht
python使用xpath提取属性值不完整 python 爬虫
2022-05-30 23:15

回答 2 已采纳不知道是不是有个逗号的原因，然后把后面给截断了，可以试试正则去提取 import requests,re url = 'https://www.renren.com/login' rep=reque
python_爬虫 05 XPath语法和lxml模块
2021-06-02 18:43

思想流浪者的博客 xpath（XML Path Language）是一门在XML和HTML文档中查找信息的语言，可用来在XML和HTML文档中对元素和属性进行遍历。二、XPath开发工具 Chrome插件XPath Helper。 Firefox插件Try XPath。三、XPath语法 ...
Python无法使用xpath解析带命名空间的html标签 html python 爬虫
2022-04-13 10:57

回答 1 已采纳 xml.etree.ElementTree --- ElementTree XML API基本库了解一下解析带有命名空间的 XML 如果 XML 输入带有命名空间，则具有前缀的 prefix:s
python爬虫关于xpath提取出来为空列表的问题 python 有问必答爬虫
2021-09-30 17:40

回答 2 已采纳你检查下这个网页中的内容是不是通过js代码读取外部json数据来动态更新的。requests只能获取网页的静态源代码，动态更新的内容取不到。对于动态更新的内容要用selenium 来爬取。或者是通
使用PHP和xPath从HTML中提取数据 html php
2013-04-12 12:42

回答 2 已采纳 Each Company can be represented by a context-node while having each property represented by an xpa
python中xpath中加随机数_python-在xpath中剥离附加项
2021-01-12 10:25

Pseudorandomness的博客我正在尝试从this website刮下这些物品.项目包括：品牌,型号和价格.由于页面结构的复杂性,Spider正在使用2个xpath选择器.品牌和型号项目来自一个xpath,价格来自另一个xpath.我使用@ har07建议的(|)运算符.对每个项目...
XPATH在python selenium中的定位当前节点的子点的问题 html5 python selenium
2020-09-18 10:38

回答 1 已采纳 https://blog.csdn.net/sun_977759/article/details/100989829
python使用xpath爬取网络数据报表结果为空 python 开发语言有问必答
2021-10-23 17:12

回答 3 已采纳该页面数据在XHR中找，构建一下headers和params，使用如下方式获json数据，然后从中解析即可： response = requests.get('https://fr.oppein.co
python+selenium+xpath如何定位网页table表格中的数据 python selenium 有问必答爬虫
2022-02-25 12:44

回答 2 已采纳使用last()定位最后一个tr节点，再用索引获取。示例： from lxml import etree with open('a.html','r',encoding='utf-8') as f:
Python成长之路——regex，bs4，xpath，jsonpath的使用
2019-04-19 12:36

有所为有所不为的博客 [aoe] [a-w] 匹配集合中任意一个字符 \d 数字[0-9] \D 非数字 \w 数字、字母、下划线、中文 \W 非\w \s 所有的空白字符 \S 非空白数量修饰类型说明 *...
在PHP中使用XPath替换XML属性 php xml
2019-06-11 17:26

回答 1 已采纳 The answer as Nigel Ren suggested was just to remove these two lines, as they no longer apply: $
python xpath模块_Python re模块， xpath 用法
2020-12-03 15:16

weixin_39860166的博客表示任意字符 '^g.d' 表示以g开头第二个为任意字符，第三个为b的字符串 *表示某个字符出现任意多次importreline= 'bobby123'regex_str= '^b.*' #以b开头的任意字符串出现任意多次 ^以什么开头 .任何字符 *出...
python beautifulsoup/xpath/re详解
2019-10-09 03:48

aa22636456的博客自己在看python处理数据的方法，发现一篇介绍比较详细的文章转自：http://blog.csdn.net/lingojames/article/details/72835972 20170531 这几天重新拾起了爬虫，算起来有将近5个月不碰python爬虫了。对照着...
没有解决我的问题, 去提问

悬赏问题

¥15 乘性高斯噪声在深度学习网络中的应用
¥15 运筹学排序问题中的在线排序
¥15 关于docker部署flink集成hadoop的yarn，请教个问题 flink启动yarn-session.sh连不上hadoop，这个整了好几天一直不行，求帮忙看一下怎么解决
¥30 求一段fortran代码用IVF编译运行的结果
¥15 深度学习根据CNN网络模型，搭建BP模型并训练MNIST数据集
¥15 C++ 头文件/宏冲突问题解决
¥15 用comsol模拟大气湍流通过底部加热（温度不同）的腔体
¥50 安卓adb backup备份子用户应用数据失败
¥20 有人能用聚类分析帮我分析一下文本内容嘛
¥30 python代码，帮调试，帮帮忙吧