dongpao1921 2011-03-02 21:46
浏览 129
已采纳

preg_match_all - 正则表达式的贪婪部分,但最大化匹配数

I have the following html to parse:

<h1 class="x">test</h1>
<p>some text <img src="x" /></p>

<h1 class="x1">test2</h1>
<p>some text </p>

<h1 class="2">test3</h1>
<p>some text <img src="x" /></p>

Can I parse this into an array with a single regular expression?

I tried

preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*)#ism',$html,$arr);

which gives me only one entry, because the last part of the regex is greedy, and

preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*?)#ism',$html,$arr);

which gives me nothing of the HTML between the <h1>, because the expression is not greedy.

How can I make the part after the be matched greedy, while at the same time matching as many occurences as possible?

Additional comments:

  • the question is fairly academical, I have resolved the problem using pre_split and a variety of other methods would work, but may also have downsides (for example DOM may not work on invalid HTML that I cannot control). However it is a recurring problem that I'd be interested to know more about.
  • 写回答

2条回答 默认 最新

  • dpquu9206 2011-03-02 21:59
    关注

    You need some form of end maker. The regex can not guess until which part you want to match.

    Possible in this case might be a lookahead assertion after the (.*?) at the end:

    (?=<h1|</body>|\z)#ims
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 求MCSCANX 帮助
  • ¥15 机器学习训练相关模型
  • ¥15 Todesk 远程写代码 anaconda jupyter python3
  • ¥15 我的R语言提示去除连锁不平衡时clump_data报错,图片以下所示,卡了好几天了,苦恼不知道如何解决,有人帮我看看怎么解决吗?
  • ¥15 在获取boss直聘的聊天的时候只能获取到前40条聊天数据
  • ¥20 关于URL获取的参数,无法执行二选一查询
  • ¥15 液位控制,当液位超过高限时常开触点59闭合,直到液位低于低限时,断开
  • ¥15 marlin编译错误,如何解决?
  • ¥15 VUE项目怎么运行,系统打不开
  • ¥50 pointpillars等目标检测算法怎么融合注意力机制