I have the following html to parse:
<h1 class="x">test</h1>
<p>some text <img src="x" /></p>
<h1 class="x1">test2</h1>
<p>some text </p>
<h1 class="2">test3</h1>
<p>some text <img src="x" /></p>
Can I parse this into an array with a single regular expression?
I tried
preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*)#ism',$html,$arr);
which gives me only one entry, because the last part of the regex is greedy, and
preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*?)#ism',$html,$arr);
which gives me nothing of the HTML between the <h1>
, because the expression is not greedy.
How can I make the part after the be matched greedy, while at the same time matching as many occurences as possible?
Additional comments:
- the question is fairly academical, I have resolved the problem using pre_split and a variety of other methods would work, but may also have downsides (for example DOM may not work on invalid HTML that I cannot control). However it is a recurring problem that I'd be interested to know more about.