如何在 PHP 中解析和处理 html / xml？

How can one parse HTML/XML and extract information from it?

转载于:https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

28条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
local-host 2010-08-26 17:19
关注
Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

A basic usage example can be found at getting all values from h1 tags using php

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.

3rd Party Libraries (libxml based)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

FluentDom

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

phpQuery (not updated for years)

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

Also see: https://github.com/electrolinux/phpquery

Zend_Dom

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.

3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

PHP Simple HTML DOM Parser

An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!

Require PHP 5+.

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.

Ganon

A universal tokenizer and HTML/XML/RSS DOM Parser

Ability to manipulate elements and their attributes

Supports invalid HTML and UTF8

Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)

A HTML beautifier (like HTML Tidy)

Minify CSS and Javascript

Sort attributes, change character case, correct indentation, etc.

Extensible

Parsing documents using callbacks based on current character/token

Operations separated in smaller functions for easy overriding

Fast and Easy

Never used it. Can't tell if it's any good.

HTML 5

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

html5lib

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.

WebServices

If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.

YQL

The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. YQL statements have a SQL-like syntax, familiar to any developer with database experience.

ScraperWiki.

ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way

Books

If you want to spend some money, have a look at

PHP Architect's Guide to Webscraping with PHP

I am not affiliated with PHP Architect or the authors.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(27条)

报告相同问题？

关注问题

如何在 PHP 中解析和处理 html / xml？ html5 php xml
2010-08-26 17:17

回答 28 已采纳 Native XML Extensions I prefer using one of the native XML extensions since they come bundled wit
如何在使用php解析时从XML文件中获取链接和粗体表示法？ php xml
2017-01-25 16:09

回答 2 已采纳 You can use asXML function to ouput the way you want: foreach ($xmls->activity as $xml) {
需要一个php解析的xml格式的类 php xml
2023-02-21 10:14

回答 2 已采纳回答不易求求您采纳哦可以使用PHP内置的SimpleXML库来解析XML数据。以下是一个示例代码，用于解析你提供的XML格式数据： $xml = simplexml_load_string($
前端基础之XML
2023-05-15 12:01

过往已是曾经的博客 XML是可扩展的标识语言（eXtensible Markup Language），其先驱是SGML和HTML。
如何解析具有相同名称的嵌套节点的XML？ xml
2019-01-29 14:37

回答 2 已采纳 You could implement a custom xml.Unmarshaler to get the results you want. type Rate struct {
在PHP中解析xml时循环？ php xml
2012-01-28 23:51

回答 1 已采纳 You need to use $i, instead of i. As a side note: This bug should become very obvious on a devel
在PHP中从XML内部解析HTML标记 php
2013-07-09 14:36

回答 3 已采纳 The description content has its special characters encoded, so it's not treated as nodes within th
PHP前端页面中html标签解析失效解决方法
2021-08-21 14:01

棒棒AIT的博客 PHP前端页面中html标签解析失效解决方法管理后台使用富文本编辑器将centent内容直接存入数据库，输出时把标签原样输出，解决办法（PHP）：组合使用strip_tags()函数以及htmlspecialchars_decode()函数如下实例：...
在HTML中显示XML html php xml
2014-12-04 22:43

回答 3 已采纳 If you want markup to be displayed then change all < to < and > to >. Before displayi
后端返给前端xml如何防止被转义
2023-01-12 13:15

一只爪子的博客在向前端返回 XML 数据时，可以使用 HTTP 头部设置 Content-Type 为 "application/xml" 或 "text/xml"，这样浏览器就能正确识别并解析 XML 格式的数据，而不会对其进行转义。在返回数据时进行编码设置 utf-8, 也是...
php://input、php://output用法解析
2021-03-19 09:20

心火灬的博客 php://output是php语言中一个只写的数据流，向“php://input”写入的数据将像 print() 和 echo() 一样的方式写入到输出缓冲区；“php://output”支持CLI（command-line interface，命令行界面）模式和Http模式； 1 ...
前端项目中常见的报错类型汇总
2023-08-09 18:06

red_paper_zj的博客前端项目中常见的报错类型
2023前端面试题总结（html，css，js）
2023-04-30 16:21

阿星有点帅的博客行内框架，在网页中可以嵌入另外一个网页优点：解决加载缓慢的第三方内容如图标和广告等的加载问题 Security sandbox 并行加载脚本缺点：阻塞onload加载事件，不利seo（你的网站在百度可以被人更快搜索到）；...
前端技术和框架
2023-02-27 10:21

Hvitur的博客 HTML、css、js、jQuery、Servlet、JSP、AJAX、VUE、axios、element-ui、node.js
没有解决我的问题, 去提问

悬赏问题

¥15 关于#c##的问题：最近需要用CAT工具Trados进行一些开发
¥15 南大pa1 小游戏没有界面，并且报了如下错误，尝试过换显卡驱动，但是好像不行
¥15 没有证书，nginx怎么反向代理到只能接受https的公网网站
¥50 成都蓉城足球俱乐部小程序抢票
¥15 yolov7训练自己的数据集
¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)（相关搜索：51单片机|单片机|测试代码）
¥15 电力市场出清matlab yalmip kkt 双层优化问题
¥30 ros小车路径规划实现不了，如何解决？(操作系统-ubuntu)
¥20 matlab yalmip kkt 双层优化问题
¥15 如何在3D高斯飞溅的渲染的场景中获得一个可控的旋转物体

如何在 PHP 中解析和处理 html / xml？

28条回答 默认 最新

Native XML Extensions

3rd Party Libraries (libxml based)

phpQuery (not updated for years)

3rd-Party (not libxml-based)

HTML 5

WebServices

ScraperWiki.

Regular Expressions

Books

28条回答默认最新