drvpv7995 2014-06-18 01:08 采纳率: 100%
浏览 49
已采纳

PHP Regex preg_match_all div不是同一个id

I have a html page like this

<!DOCTYPE html>
    <html>
        ....
        <body>
            <div class="list-news fl pt10 ">
                Blue
            </div>
            <div class="list-news fl pt10 alternative">
                Yellow
            </div>
             <div class="list-news fl pt10 ">
                Red
            </div>
            <div class="list-news fl pt10 alternative">
                Cyan
            </div>
            <div class="list-news fl pt10 ">
                Black
            </div>
            <div class="list-news fl pt10 alternative">
                White
            </div>
        </body>
    </html>

Now i will write a sort php code for get all content i need

preg_match_all('@<div class="list-news fl pt10 .*?">(.*?)<div class="list-news fl pt10 .*?">@s',$rs,$match);

Now this is result

[1] => Array
(
    [0] => <div>Blue</div></div>
    [1] => <div>Red</div></div>
    [2] => <div>Black</div></div>
)

Result only show content in div <div class="list-news fl pt10 "> and not get content in <div class="list-news fl pt10 alternative"> i can using str_replace for remove alternative class but if don't replace this string, how can get all content in every div match class list-news fl pt10.*??

Thanks for idea.

  • 写回答

1条回答 默认 最新

  • douzhuo2722 2014-06-18 01:33
    关注

    A DOM approach (with a naive contains):

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = <<<'EOD'
    //div[
        contains(@class, 'list-news') and
        contains(@class, 'fl') and
        contains(@class, 'pt10')]
    EOD;
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach ($nodes as $node) {
        $results[] = trim($node->textContent);
    
    }
    print_r($results);
    

    A regex approach (with a naive pattern):

    preg_match_all('~<div class="list-news fl pt10\b[^>]+>\s*\K.*?(?=\s*</div>)~',
                   $html, $matches);
    print_r($matches[0]);
    

    The two ways are a little naive because contains doesn't care about word boundaries and the classes order, and the regex pattern doesn't care about the possible irregularities of an html code.

    The reason your pattern doesn't work is that you can't obtain overlapping matches. Since the first occurrence ends with <div class="list-news..., the next occurrence can't begin with the same <div class="list-news... that has been already matched.

    Putting the last <div class="list-news... in a lookahead (?=...) (that is only a check and where the content is not a part of the match result) can be a way. However, it is more simple to use the closing tag </div>.

    \K is used to remove all that has been matched before (on the left) from the match result.

    A good compromise can be to extract all the div tags that contain a class attribute, and after to check with a regex if the attribute value is really what you want before extracting and triming the text content:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = '//div[@class]';
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach($nodes as $node) {
        if ( preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    

    or without XPath:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $divs = $dom->getElementsByTagName('div');
    
    $results = array();
    
    foreach($divs as $node) {
        if ( $node->hasAttribute('class') &&
             preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 metadata提取的PDF元数据,如何转换为一个Excel
  • ¥15 关于arduino编程toCharArray()函数的使用
  • ¥100 vc++混合CEF采用CLR方式编译报错
  • ¥15 coze 的插件输入飞书多维表格 app_token 后一直显示错误,如何解决?
  • ¥15 vite+vue3+plyr播放本地public文件夹下视频无法加载
  • ¥15 c#逐行读取txt文本,但是每一行里面数据之间空格数量不同
  • ¥50 如何openEuler 22.03上安装配置drbd
  • ¥20 ING91680C BLE5.3 芯片怎么实现串口收发数据
  • ¥15 无线连接树莓派,无法执行update,如何解决?(相关搜索:软件下载)
  • ¥15 Windows11, backspace, enter, space键失灵