drvpv7995 2014-06-18 01:08 采纳率: 100%
浏览 49
已采纳

PHP Regex preg_match_all div不是同一个id

I have a html page like this

<!DOCTYPE html>
    <html>
        ....
        <body>
            <div class="list-news fl pt10 ">
                Blue
            </div>
            <div class="list-news fl pt10 alternative">
                Yellow
            </div>
             <div class="list-news fl pt10 ">
                Red
            </div>
            <div class="list-news fl pt10 alternative">
                Cyan
            </div>
            <div class="list-news fl pt10 ">
                Black
            </div>
            <div class="list-news fl pt10 alternative">
                White
            </div>
        </body>
    </html>

Now i will write a sort php code for get all content i need

preg_match_all('@<div class="list-news fl pt10 .*?">(.*?)<div class="list-news fl pt10 .*?">@s',$rs,$match);

Now this is result

[1] => Array
(
    [0] => <div>Blue</div></div>
    [1] => <div>Red</div></div>
    [2] => <div>Black</div></div>
)

Result only show content in div <div class="list-news fl pt10 "> and not get content in <div class="list-news fl pt10 alternative"> i can using str_replace for remove alternative class but if don't replace this string, how can get all content in every div match class list-news fl pt10.*??

Thanks for idea.

  • 写回答

1条回答 默认 最新

  • douzhuo2722 2014-06-18 01:33
    关注

    A DOM approach (with a naive contains):

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = <<<'EOD'
    //div[
        contains(@class, 'list-news') and
        contains(@class, 'fl') and
        contains(@class, 'pt10')]
    EOD;
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach ($nodes as $node) {
        $results[] = trim($node->textContent);
    
    }
    print_r($results);
    

    A regex approach (with a naive pattern):

    preg_match_all('~<div class="list-news fl pt10\b[^>]+>\s*\K.*?(?=\s*</div>)~',
                   $html, $matches);
    print_r($matches[0]);
    

    The two ways are a little naive because contains doesn't care about word boundaries and the classes order, and the regex pattern doesn't care about the possible irregularities of an html code.

    The reason your pattern doesn't work is that you can't obtain overlapping matches. Since the first occurrence ends with <div class="list-news..., the next occurrence can't begin with the same <div class="list-news... that has been already matched.

    Putting the last <div class="list-news... in a lookahead (?=...) (that is only a check and where the content is not a part of the match result) can be a way. However, it is more simple to use the closing tag </div>.

    \K is used to remove all that has been matched before (on the left) from the match result.

    A good compromise can be to extract all the div tags that contain a class attribute, and after to check with a regex if the attribute value is really what you want before extracting and triming the text content:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = '//div[@class]';
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach($nodes as $node) {
        if ( preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    

    or without XPath:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $divs = $dom->getElementsByTagName('div');
    
    $results = array();
    
    foreach($divs as $node) {
        if ( $node->hasAttribute('class') &&
             preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 电脑开机过商标后就直接这样,求解各位
  • ¥15 mysql , 用自己创建的本地主机和用户名 登录不上
  • ¥15 关于#web项目#的问题,请各位专家解答!
  • ¥15 vmtools环境不正常
  • ¥15 请问如何在Ubuntu系统中安装使用microsoft R open?
  • ¥15 jupyter notebook
  • ¥30 informer时间序列预测
  • ¥20 SSR引物多态性分析
  • ¥15 大漠插件在Win11易语言注册调用和免注册灵异事件,VS上注册调用完全没问题
  • ¥15 Addressable缓存机制做热更新的问题