dongya2578 2014-01-07 06:42
浏览 151
已采纳

如何提取某些HTML标记,例如 <ul>在PHP中使用带有preg_match_all的Regex?

I am new to regular expressions. I want to fetch some data from a web page source. I used file_get_contents("url") to get the page's HTML source. Now I want to capture a portion within some special tags.

I found preg_match_all() works for this. Now I want some help to solve my problem and if possible help me to find out how to solve similar problems like this.

In the example below, how would I get the data within the <ul>? (I wish this sample HTML code could be easier for me to understand.)

<div class="a_a">qqqqq<span>www</span> </div>
<ul>
<li>
    <div class="a_a"><h3>aaaa</h3> aaaa aaaaa</div>
</li>
<li>
    <div class="b_b">bbbbb <span class="s-s">bbbb</span> bbbb</div>
</li>
<li>
    <div class="c_c d-d">cccc cccc ccccc</div>
</li>
</ul>
<table>
<tr>
    <td>sdsdf</td>
    <td>hjhjhj</td>
</tr>
<tr>
    <td>yuyuy</td>
    <td>ertre</td>
</tr>   
</table>
  • 写回答

2条回答 默认 最新

  • douchilian1009 2014-01-07 09:38
    关注

    As the comments stated already, it's generally not recommended to parse html with regex. In my opinion, it depends on what exactly you're going to do.


    If you want to use regex and know, that there are no nested tags of the same kind, the most simple pattern for getting everything that's between <ul> and closest </ul> would be:

    $pattern = '~<ul>(.*?)</ul>~s';
    

    It matches <ul> followed by as few characters of any kind as possible to meet </ul>. The dot is a metacharacter, that matches any single character except newlines ( ). To make it match newlines too, after the ending delimiter ~ I put the s-modifier. The quantifier * means zero or more times.

    By default quantifiers are greedy, which means, they eat up as much as possible to be satisfied. A question-mark ? after the * makes them non-greedy (or lazy) and match as few characters as possible to meet </ul>. As pattern-delimiter I chose the ~ tilde.

    preg_match_all($pattern, $html, $out);
    

    Matches are captured and can be found in the output-variable, that you set for preg_match or preg_match_all, where [0] contains everything, that matches the whole pattern, [1] the first captured parenthesized subpattern, ...


    If your searched tag can contain attributes (e.g. <ul class="my_list"...) this extended pattern, would after <ul also include [^>]* any amount of characters, that are not > before meeting >

    $pattern = '~<ul[^>]*>\K.*(?=</ul>)~Uis';
    

    Instead of the question-mark, here I use the U-modifier, to make all quantifiers lazy. For only getting captured the desired parts, that are <ul> inside </ul>. \K is used to reset beginning of the reported match. Instead of capturing the ending </ul> a lookahead is used (?=, as we neither want that part in the output.

    This is basically the same as '~<ul[^>]*>(.*)</ul>~Uis' which would capture whole-pattern matches to [0] and first parenthesized group to [1].


    But, if your html contains nested tags of same kind, the idea of the following pattern is to catch the innermost ones. At each character inside <ul>...</ul> it checks if there is no opening <ul

    $pattern = '~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis';
    

    Get matches using preg_match_all

    $html = '<div><ul><li><ul><li>.1.</li></ul>...</li></ul></div>
             <ul><li>.2.</li></ul>';
    
    if(preg_match_all($pattern, $html, $out))
    {
      echo "<pre>"; print_r(array_map('htmlspecialchars',$out[0])); echo "</pre>";
    } else {
    
      echo "FAIL";
    }
    

    Matches between \K and (?= will be captured to $out[0]

    • \K resets beginning of the reported match (supported in PHP since 5.2.4)
    • the second pattern, when <ul> matched, looks ahead (?!... at each character, if there's no opening <ul before meeting </ul>, if so starts over until </ul> is ahead (?=</ul>).
    • [^>]* any amount of characters, that are not > (negated character class)
    • (?: starts a non-capturing group.

    Used Modifiers: Uis (part after the ending delimiter ~)

    U (PCRE_UNGREEDY), i (PCRE_CASELESS), s (PCRE_DOTALL)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥30 这是哪个作者做的宝宝起名网站
  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!