As the comments stated already, it's generally not recommended to parse html with regex. In my opinion, it depends on what exactly you're going to do.
If you want to use regex and know, that there are no nested tags of the same kind, the most simple pattern for getting everything that's between <ul>
and closest </ul>
would be:
$pattern = '~<ul>(.*?)</ul>~s';
It matches <ul>
followed by as few characters of any kind as possible to meet </ul>
. The dot is a metacharacter, that matches any single character except newlines (
). To make it match newlines too, after the ending delimiter ~
I put the s-modifier. The quantifier *
means zero or more times.
By default quantifiers are greedy, which means, they eat up as much as possible to be satisfied. A question-mark ?
after the *
makes them non-greedy (or lazy) and match as few characters as possible to meet </ul>
. As pattern-delimiter I chose the ~
tilde.
preg_match_all($pattern, $html, $out);
Matches are captured and can be found in the output-variable, that you set for preg_match
or preg_match_all
, where [0]
contains everything, that matches the whole pattern, [1]
the first captured parenthesized subpattern, ...
If your searched tag can contain attributes (e.g. <ul class="my_list"...
) this extended pattern, would after <ul
also include [^>]*
any amount of characters, that are not >
before meeting >
$pattern = '~<ul[^>]*>\K.*(?=</ul>)~Uis';
Instead of the question-mark, here I use the U
-modifier, to make all quantifiers lazy. For only getting captured the desired parts, that are <ul>
inside </ul>
. \K
is used to reset beginning of the reported match. Instead of capturing the ending </ul>
a lookahead is used (?=
, as we neither want that part in the output.
This is basically the same as '~<ul[^>]*>(.*)</ul>~Uis'
which would capture whole-pattern matches to [0]
and first parenthesized group to [1]
.
But, if your html contains nested tags of same kind, the idea of the following pattern is to catch the innermost ones. At each character inside <ul>
...</ul>
it checks if there is no opening <ul
$pattern = '~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis';
Get matches using preg_match_all
$html = '<div><ul><li><ul><li>.1.</li></ul>...</li></ul></div>
<ul><li>.2.</li></ul>';
if(preg_match_all($pattern, $html, $out))
{
echo "<pre>"; print_r(array_map('htmlspecialchars',$out[0])); echo "</pre>";
} else {
echo "FAIL";
}
Matches between \K
and (?=
will be captured to $out[0]
-
\K resets beginning of the reported match (supported in PHP since 5.2.4)
- the second pattern, when
<ul>
matched, looks ahead (?!...
at each character, if there's no opening <ul
before meeting </ul>
, if so starts over until </ul>
is ahead (?=</ul>)
.
-
[^>]*
any amount of characters, that are not >
(negated character class)
-
(?:
starts a non-capturing group.
Used Modifiers: Uis
(part after the ending delimiter ~
)
U
(PCRE_UNGREEDY), i
(PCRE_CASELESS), s
(PCRE_DOTALL)