dongpao1921 2011-03-02 21:46
浏览 129
已采纳

preg_match_all - 正则表达式的贪婪部分,但最大化匹配数

I have the following html to parse:

<h1 class="x">test</h1>
<p>some text <img src="x" /></p>

<h1 class="x1">test2</h1>
<p>some text </p>

<h1 class="2">test3</h1>
<p>some text <img src="x" /></p>

Can I parse this into an array with a single regular expression?

I tried

preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*)#ism',$html,$arr);

which gives me only one entry, because the last part of the regex is greedy, and

preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*?)#ism',$html,$arr);

which gives me nothing of the HTML between the <h1>, because the expression is not greedy.

How can I make the part after the be matched greedy, while at the same time matching as many occurences as possible?

Additional comments:

  • the question is fairly academical, I have resolved the problem using pre_split and a variety of other methods would work, but may also have downsides (for example DOM may not work on invalid HTML that I cannot control). However it is a recurring problem that I'd be interested to know more about.
  • 写回答

2条回答 默认 最新

  • dpquu9206 2011-03-02 21:59
    关注

    You need some form of end maker. The regex can not guess until which part you want to match.

    Possible in this case might be a lookahead assertion after the (.*?) at the end:

    (?=<h1|</body>|\z)#ims
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 孟德尔随机化怎样画共定位分析图
  • ¥18 模拟电路问题解答有偿速度
  • ¥15 CST仿真别人的模型结果仿真结果S参数完全不对
  • ¥15 误删注册表文件致win10无法开启
  • ¥15 请问在阿里云服务器中怎么利用数据库制作网站
  • ¥60 ESP32怎么烧录自启动程序
  • ¥50 html2canvas超出滚动条不显示
  • ¥15 java业务性能问题求解(sql,业务设计相关)
  • ¥15 52810 尾椎c三个a 写蓝牙地址
  • ¥15 elmos524.33 eeprom的读写问题