如何提取某些HTML标记，例如 <ul>在PHP中使用带有preg_match_all的Regex？

I am new to regular expressions. I want to fetch some data from a web page source. I used file_get_contents("url") to get the page's HTML source. Now I want to capture a portion within some special tags.

I found preg_match_all() works for this. Now I want some help to solve my problem and if possible help me to find out how to solve similar problems like this.

In the example below, how would I get the data within the <ul>? (I wish this sample HTML code could be easier for me to understand.)

<div class="a_a">qqqqq<span>www</span> </div>
<ul>
<li>
    <div class="a_a"><h3>aaaa</h3> aaaa aaaaa</div>
</li>
<li>
    <div class="b_b">bbbbb <span class="s-s">bbbb</span> bbbb</div>
</li>
<li>
    <div class="c_c d-d">cccc cccc ccccc</div>
</li>
</ul>
<table>
<tr>
    <td>sdsdf</td>
    <td>hjhjhj</td>
</tr>
<tr>
    <td>yuyuy</td>
    <td>ertre</td>
</tr>   
</table>

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douchilian1009 2014-01-07 09:38
关注
As the comments stated already, it's generally not recommended to parse html with regex. In my opinion, it depends on what exactly you're going to do.

If you want to use regex and know, that there are no nested tags of the same kind, the most simple pattern for getting everything that's between <ul> and closest </ul> would be:

$pattern = '~<ul>(.*?)</ul>~s';

It matches <ul> followed by as few characters of any kind as possible to meet </ul>. The dot is a metacharacter, that matches any single character except newlines (). To make it match newlines too, after the ending delimiter ~ I put the s-modifier. The quantifier * means zero or more times.

By default quantifiers are greedy, which means, they eat up as much as possible to be satisfied. A question-mark ? after the * makes them non-greedy (or lazy) and match as few characters as possible to meet </ul>. As pattern-delimiter I chose the ~ tilde.

preg_match_all($pattern, $html, $out);

Matches are captured and can be found in the output-variable, that you set for preg_match or preg_match_all, where [0] contains everything, that matches the whole pattern, [1] the first captured parenthesized subpattern, ...

If your searched tag can contain attributes (e.g. <ul class="my_list"...) this extended pattern, would after <ul also include [^>]* any amount of characters, that are not > before meeting >

$pattern = '~<ul[^>]*>\K.*(?=</ul>)~Uis';

Instead of the question-mark, here I use the U-modifier, to make all quantifiers lazy. For only getting captured the desired parts, that are <ul> inside </ul>. \K is used to reset beginning of the reported match. Instead of capturing the ending </ul> a lookahead is used (?=, as we neither want that part in the output.

This is basically the same as '~<ul[^>]*>(.*)</ul>~Uis' which would capture whole-pattern matches to [0] and first parenthesized group to [1].

But, if your html contains nested tags of same kind, the idea of the following pattern is to catch the innermost ones. At each character inside <ul>...</ul> it checks if there is no opening <ul

$pattern = '~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis';

Get matches using preg_match_all

$html = '<div><ul><li><ul><li>.1.</li></ul>...</li></ul></div> <ul><li>.2.</li></ul>'; if(preg_match_all($pattern, $html, $out)) { echo "<pre>"; print_r(array_map('htmlspecialchars',$out[0])); echo "</pre>"; } else { echo "FAIL"; }

Matches between \K and (?= will be captured to $out[0]

\K resets beginning of the reported match (supported in PHP since 5.2.4)

the second pattern, when <ul> matched, looks ahead (?!... at each character, if there's no opening <ul before meeting </ul>, if so starts over until </ul> is ahead (?=</ul>).

[^>]* any amount of characters, that are not > (negated character class)

(?: starts a non-capturing group.

Used Modifiers: Uis (part after the ending delimiter ~)

U (PCRE_UNGREEDY), i (PCRE_CASELESS), s (PCRE_DOTALL)
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

如何提取某些HTML标记，例如 <ul>在PHP中使用带有preg_match_all的Regex？ html php
2014-01-07 06:42

回答 2 已采纳 As the comments stated already, it's generally not recommended to parse html with regex. In my opi
使用正则表达式和php preg_match_all在括号之间获取字符串 php
2017-07-14 12:34

回答 2 已采纳 This method will extract your desired substrings and prepare the output data as you have requested
如何使用preg_match进行验证用户名？ php
2018-02-05 01:11

回答 2 已采纳 You match these criteria, maybe this will be an option: ^[a-z](?=(?:[a-z]*\d){0,4}(?![a-z]*\d))(?
php html补全,php截取html字符串及自动补全html标签的方法_PHP
2021-03-23 14:52

weixin_39915204的博客具体分析如下：这里总结一下关于利用php截取html字符串自动补全html标签,实际开发中会经常碰到,很多人直接先strip_tags过滤掉html标签,但是就只剩下纯文本了,可读性非常差,下面是一个函数,代码如下:代码如下:/*** ...
preg_match（）+ regex在TXT文件中不起作用 php
2014-12-21 22:10

回答 2 已采纳 Your action of copying and pasting the output text manually seems to have actually changed its con
如何用中文获得preg_match 2位数点（可选）？ [重复] php
2018-03-14 03:06

回答 2 已采纳 Try /\p{Han}+/gu like: preg_match("/\p{Han}+/u", $message,$result); or preg_match_all("/\p{Han
如何preg_match所有样式标签？ [重复] html php
2016-02-08 03:28

回答 1 已采纳 Regular expression quantifiers are greedy by default, meaning they match as much as possible. To m
一篇文带你从0到1了解建站及完成CMS系统编写
2020-10-24 00:48

1_bit的博客学习目标了解搭建一般网站的简便方式了解最原始一般站点搭建了解内容管理站点搭建了解权限设计及完成 ...文章为从0到1了解内容管理系统搭建与编写，由于一篇文章内容篇幅过长，文章内容经过压缩，该项目中相
如何通过preg_match_all获取来自同一对象的所有匹配？ [重复] php
2015-12-10 20:41

回答 1 已采纳 Now usually I'm first to say it's fine to use regexps to extract data from HTML occasionally, as i
带有空格，下划线，短划线和圆点的php中的preg_match [重复] php
2014-12-29 03:13

回答 2 已采纳 You should use: if (!preg_match("/^[\w\s\.-]*$/",$string)) { #show error } It will also mat
preg_match_all找不到匹配 php
2014-08-20 16:15

回答 1 已采纳 Since your input has newlines as well you need s (DOTALL) flag to make dot match newlines: $regex
PHP preg_replace
2013-07-20 18:06

weixin_30408165的博客 preg_replace (PHP 3 >= 3.0.9, PHP 4, PHP 5) preg_replace--执行正则表达式的搜索和替换说明 mixed preg_replace ( mixed pattern, mixed replacement, mixed subject [, int limit] ) 在 ...
PHP - preg_match_all - 有点高级 php
2015-06-27 09:50

回答 1 已采纳 The following should do the trick: \b(?:(?=.{0,3}?\d)[A-Za-z\d]{4}\s??){3}\b Demo [A-Za-z\d]{
PHP 获取一段HTML标签,php截取html字符串及自动补全html标签的方法
2021-03-23 12:06

卡卡乐乐的博客具体分析如下：这里总结一下关于利用php截取html字符串自动补全html标签,实际开发中会经常碰到,很多人直接先strip_tags过滤掉html标签,但是就只剩下纯文本了,可读性非常差,下面是一个函数,代码如下:/*** 截取HTML,并...
7万字介绍一款waf（web应用防火墙），再也不怕有人入侵了
2022-07-28 16:29

门柚的博客 7万字介绍一款waf（web应用防火墙），...该功能可以在软件或硬件中实现，可以在设备设备中运行，也可以在运行公共操作系统的典型服务器中运行。它可能是一个独立的设备或集成到其他网络组件。(来源:PCI DSS IS 6.6)...
PHP正则表达式及表单注册案例
2020-05-25 08:35

BUG制造者:图图的博客 preg_match 匹配 $pattern = '/php/'; $subject = "php 是最好的编程语言，php 没有之一！"; $result = preg_match($pattern,$subject); if($result){ echo "<h1>匹配成功</h1>"; }else{ echo "&...
php文章摘要,自动生成文章摘要的代码[PHP 版本]
2021-04-09 10:41

李棠辉的博客说明：这是PHP版的，用于在服务器端使用，如果你需要一个客户端版的，请阅读下一篇我们在写BLOG这样的程序时经常需要显示文章前一部分的，但是又怕不恰当的截断破坏封闭标签以造成整个文档结构破坏，使用我的函数...
php的正则表达式完全手册
2012-08-30 13:53

e421083458的博客 ^ 匹配输入字符串的开始位置，除非在方括号表达式中使用，此时它表示不接受该字符集合。要匹配 ^ 字符本身，请使用 \^。 { 标记限定符表达式的开始。要匹配 {，请使用 \{。 | 指明两项之间的一个选择...
php文章摘要,自动生成文章摘要的php代码
2021-04-09 10:42

weixin_39640372的博客 // PHP 4.3 or above neededdefine("BRIEF_LENGTH", 800); //Word amount of the Briefing of an Articlefunction Generate_Brief($text){global $Briefing_Length;if(strlen($text) <= BRIEF_LENGTH ) return $t...
杂乱手札 - LINUX, Apache, Mysql, PHP, HTML-JS-CSS, Redis 2014 to 2016
2016-03-24 17:24

Rudon滨海渔村的博客杂乱手札 - LINUX, Apache, Mysql, PHP, HTML-JS-CSS，Redis 一分耕耘一分收获。
没有解决我的问题, 去提问

悬赏问题

¥30 这是哪个作者做的宝宝起名网站
¥60 版本过低apk如何修改可以兼容新的安卓系统
¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！

如何提取某些HTML标记，例如 <ul>在PHP中使用带有preg_match_all的Regex？

2条回答 默认 最新

悬赏问题

2条回答默认最新