drvpv7995 2014-06-18 01:08 采纳率: 100%
浏览 49
已采纳

PHP Regex preg_match_all div不是同一个id

I have a html page like this

<!DOCTYPE html>
    <html>
        ....
        <body>
            <div class="list-news fl pt10 ">
                Blue
            </div>
            <div class="list-news fl pt10 alternative">
                Yellow
            </div>
             <div class="list-news fl pt10 ">
                Red
            </div>
            <div class="list-news fl pt10 alternative">
                Cyan
            </div>
            <div class="list-news fl pt10 ">
                Black
            </div>
            <div class="list-news fl pt10 alternative">
                White
            </div>
        </body>
    </html>

Now i will write a sort php code for get all content i need

preg_match_all('@<div class="list-news fl pt10 .*?">(.*?)<div class="list-news fl pt10 .*?">@s',$rs,$match);

Now this is result

[1] => Array
(
    [0] => <div>Blue</div></div>
    [1] => <div>Red</div></div>
    [2] => <div>Black</div></div>
)

Result only show content in div <div class="list-news fl pt10 "> and not get content in <div class="list-news fl pt10 alternative"> i can using str_replace for remove alternative class but if don't replace this string, how can get all content in every div match class list-news fl pt10.*??

Thanks for idea.

  • 写回答

1条回答 默认 最新

  • douzhuo2722 2014-06-18 01:33
    关注

    A DOM approach (with a naive contains):

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = <<<'EOD'
    //div[
        contains(@class, 'list-news') and
        contains(@class, 'fl') and
        contains(@class, 'pt10')]
    EOD;
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach ($nodes as $node) {
        $results[] = trim($node->textContent);
    
    }
    print_r($results);
    

    A regex approach (with a naive pattern):

    preg_match_all('~<div class="list-news fl pt10\b[^>]+>\s*\K.*?(?=\s*</div>)~',
                   $html, $matches);
    print_r($matches[0]);
    

    The two ways are a little naive because contains doesn't care about word boundaries and the classes order, and the regex pattern doesn't care about the possible irregularities of an html code.

    The reason your pattern doesn't work is that you can't obtain overlapping matches. Since the first occurrence ends with <div class="list-news..., the next occurrence can't begin with the same <div class="list-news... that has been already matched.

    Putting the last <div class="list-news... in a lookahead (?=...) (that is only a check and where the content is not a part of the match result) can be a way. However, it is more simple to use the closing tag </div>.

    \K is used to remove all that has been matched before (on the left) from the match result.

    A good compromise can be to extract all the div tags that contain a class attribute, and after to check with a regex if the attribute value is really what you want before extracting and triming the text content:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = '//div[@class]';
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach($nodes as $node) {
        if ( preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    

    or without XPath:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $divs = $dom->getElementsByTagName('div');
    
    $results = array();
    
    foreach($divs as $node) {
        if ( $node->hasAttribute('class') &&
             preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥60 不懂得怎么运行下载来的代码
  • ¥15 CST导出3D模型图为什么和软件显示不一样?
  • ¥15 加热反应炉PLC控制系统设计(相关搜索:梯形图)
  • ¥15 python 用Dorc包报错,我的写法和网上教的是一样的但是它显示无效参数,是什么问题
  • ¥15 经过滑动平均后的一维信号还原用什么结构好呢?
  • ¥15 指定IP电脑的访问设置
  • ¥30 matlab ode45 未发现警告,但是运行出错
  • ¥15 为什么devc++编译项目会失败啊
  • ¥15 vscode platformio
  • ¥15 代写uni代码,app唤醒