dongya2578 2014-01-07 06:42
浏览 151
已采纳

如何提取某些HTML标记,例如 <ul>在PHP中使用带有preg_match_all的Regex?

I am new to regular expressions. I want to fetch some data from a web page source. I used file_get_contents("url") to get the page's HTML source. Now I want to capture a portion within some special tags.

I found preg_match_all() works for this. Now I want some help to solve my problem and if possible help me to find out how to solve similar problems like this.

In the example below, how would I get the data within the <ul>? (I wish this sample HTML code could be easier for me to understand.)

<div class="a_a">qqqqq<span>www</span> </div>
<ul>
<li>
    <div class="a_a"><h3>aaaa</h3> aaaa aaaaa</div>
</li>
<li>
    <div class="b_b">bbbbb <span class="s-s">bbbb</span> bbbb</div>
</li>
<li>
    <div class="c_c d-d">cccc cccc ccccc</div>
</li>
</ul>
<table>
<tr>
    <td>sdsdf</td>
    <td>hjhjhj</td>
</tr>
<tr>
    <td>yuyuy</td>
    <td>ertre</td>
</tr>   
</table>
  • 写回答

2条回答 默认 最新

  • douchilian1009 2014-01-07 09:38
    关注

    As the comments stated already, it's generally not recommended to parse html with regex. In my opinion, it depends on what exactly you're going to do.


    If you want to use regex and know, that there are no nested tags of the same kind, the most simple pattern for getting everything that's between <ul> and closest </ul> would be:

    $pattern = '~<ul>(.*?)</ul>~s';
    

    It matches <ul> followed by as few characters of any kind as possible to meet </ul>. The dot is a metacharacter, that matches any single character except newlines ( ). To make it match newlines too, after the ending delimiter ~ I put the s-modifier. The quantifier * means zero or more times.

    By default quantifiers are greedy, which means, they eat up as much as possible to be satisfied. A question-mark ? after the * makes them non-greedy (or lazy) and match as few characters as possible to meet </ul>. As pattern-delimiter I chose the ~ tilde.

    preg_match_all($pattern, $html, $out);
    

    Matches are captured and can be found in the output-variable, that you set for preg_match or preg_match_all, where [0] contains everything, that matches the whole pattern, [1] the first captured parenthesized subpattern, ...


    If your searched tag can contain attributes (e.g. <ul class="my_list"...) this extended pattern, would after <ul also include [^>]* any amount of characters, that are not > before meeting >

    $pattern = '~<ul[^>]*>\K.*(?=</ul>)~Uis';
    

    Instead of the question-mark, here I use the U-modifier, to make all quantifiers lazy. For only getting captured the desired parts, that are <ul> inside </ul>. \K is used to reset beginning of the reported match. Instead of capturing the ending </ul> a lookahead is used (?=, as we neither want that part in the output.

    This is basically the same as '~<ul[^>]*>(.*)</ul>~Uis' which would capture whole-pattern matches to [0] and first parenthesized group to [1].


    But, if your html contains nested tags of same kind, the idea of the following pattern is to catch the innermost ones. At each character inside <ul>...</ul> it checks if there is no opening <ul

    $pattern = '~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis';
    

    Get matches using preg_match_all

    $html = '<div><ul><li><ul><li>.1.</li></ul>...</li></ul></div>
             <ul><li>.2.</li></ul>';
    
    if(preg_match_all($pattern, $html, $out))
    {
      echo "<pre>"; print_r(array_map('htmlspecialchars',$out[0])); echo "</pre>";
    } else {
    
      echo "FAIL";
    }
    

    Matches between \K and (?= will be captured to $out[0]

    • \K resets beginning of the reported match (supported in PHP since 5.2.4)
    • the second pattern, when <ul> matched, looks ahead (?!... at each character, if there's no opening <ul before meeting </ul>, if so starts over until </ul> is ahead (?=</ul>).
    • [^>]* any amount of characters, that are not > (negated character class)
    • (?: starts a non-capturing group.

    Used Modifiers: Uis (part after the ending delimiter ~)

    U (PCRE_UNGREEDY), i (PCRE_CASELESS), s (PCRE_DOTALL)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 有没有帮写代码做实验仿真的
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥30 vmware exsi重置后登不上
  • ¥15 易盾点选的cb参数怎么解啊
  • ¥15 MATLAB运行显示错误,如何解决?
  • ¥15 c++头文件不能识别CDialog
  • ¥15 Excel发现不可读取的内容
  • ¥15 关于#stm32#的问题:CANOpen的PDO同步传输问题
  • ¥20 yolov5自定义Prune报错,如何解决?