使用PHP(XPath),PHP / Python(Regex)或Python(XPath)从html中提取信息

我有约。 40k + html文档,我需要从中提取信息。 我曾尝试使用PHP + Tidy(因为大多数文件格式不正确)+ DOMDocument + XPath但它非常慢....我建议使用regexp但是html文件没有语义标记(表格 基于布局,无意义的标签/类在任何地方使用)我不知道我应该从哪里开始... </ p>

只是好奇,更快地使用regexp(PHP / Python) 比使用Python的XPath库? Python的Xpath库通常比PHP的对应库快吗?</ p>
</ div>

展开原文

原文

I have approx. 40k+ html documents where I need to extract information from. I have tried to do so using PHP+Tidy(because most files are not well-formed)+DOMDocument+XPath but it is extremely slow.... I am advised to use regexp but the html files are not marked up semantically (table based layout, with meaning-less tag/classes used everywhere) and I don't know where i should start...

Just being curious, is using regexp (PHP/Python) faster than using Python's XPath library? Is Xpath library for Python generally faster than PHP's counterpart?

3个回答



如果要求速度,请查看 lxml 。 lxml是 libxml2 libxslt C库。 使用C库比任何纯php或python版本快得多。</ p>

有一些令人印象深刻的基准测试:</ p>


结论</ strong > </ p>

在我开始使用这些基准测试之前,我知道lxml很快,但我没想到它会这么快。</ p>
</ blockquote>

解析结果:</ p>

Parsing Resutls http://1.2.3.9/bmi/blog.ianbicking.org/wp-content/uploads/images/parsing-results.png </ p>
</ DIV>

展开原文

原文

If speed is a requirement have a look at lxml. lxml is a pythonic binding for the libxml2 and libxslt C libraries. Using the C libraries is much faster than any pure php or python version.

There are some impressive benchmarks from Ian Bicking:

In Conclusion

I knew lxml was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

Parsing Results:

Parsing Resutls http://1.2.3.9/bmi/blog.ianbicking.org/wp-content/uploads/images/parsing-results.png



你可以给美丽的汤”。 这是一个非常好的解析器,用于从垃圾HTML生成可用的DOM。 有一些正则表达式技能可能会得到你所需要的。 快乐狩猎!</ p>

在我的主观体验中,Python中的大多数比较操作都比PHP快。 部分原因是Python是一种编译语言,而不是在运行时解释,部分原因是Python已经针对其贡献者提高效率进行了优化... </ p>

仍然,对于40k +文档,找到一个不错的 快速机器; - )</ p>
</ div>

展开原文

原文

You might give Beautiful Soup in Python a try. It's a pretty great parser for generating a usable DOM out of garbage HTML. That with some regex skills might get you what you need. Happy hunting!

Most comparative operations in Python are faster than in PHP in my subjective experience. Partly due to Python being a compiled language instead of interpreted at runtime, partly due to Python having been optimized for greater efficiency by its contributors...

Still, for 40k+ documents, find a nice fast machine ;-)

doupai1876
doupai1876 其他地方提到的lxml也有类似BeautifulSoup的API。
大约 11 年之前 回复
dqwr32867
dqwr32867 谢谢你的答案:D生产机器的工作速度应该比我的dev快两倍。 电脑:D但它在我的机器上运行方式太慢了:/将很快给出美味的汤
大约 11 年之前 回复



正如上一篇文章所提到的,由于字节码编译(那些.pyc文件),Python一般比php快。 而且很多DOM / SAX解析器无论如何都要在内部使用相当的regexp。 那些告诉你使用正则表达式的人需要被告知它不是一个神奇的子弹。 对于40k +文档,我建议使用新的多线程或经典的 parallel python 来并行化任务。</ p>
</ div>

展开原文

原文

As the previous post mentions Python in general is faster than php due to byte-code compilation (those .pyc files). And a lot of DOM/SAX parsers use fair bit of regexp internally anyway. Those who told you to use regexp need to be told that it is not a magic bullet. For 40k+ documents I would recommend parallelizing the task using the new multi-threads or the classic parallel python.

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问
相关内容推荐