drvpv7995 2014-06-18 01:08 采纳率: 100%
浏览 49
已采纳

PHP Regex preg_match_all div不是同一个id

I have a html page like this

<!DOCTYPE html>
    <html>
        ....
        <body>
            <div class="list-news fl pt10 ">
                Blue
            </div>
            <div class="list-news fl pt10 alternative">
                Yellow
            </div>
             <div class="list-news fl pt10 ">
                Red
            </div>
            <div class="list-news fl pt10 alternative">
                Cyan
            </div>
            <div class="list-news fl pt10 ">
                Black
            </div>
            <div class="list-news fl pt10 alternative">
                White
            </div>
        </body>
    </html>

Now i will write a sort php code for get all content i need

preg_match_all('@<div class="list-news fl pt10 .*?">(.*?)<div class="list-news fl pt10 .*?">@s',$rs,$match);

Now this is result

[1] => Array
(
    [0] => <div>Blue</div></div>
    [1] => <div>Red</div></div>
    [2] => <div>Black</div></div>
)

Result only show content in div <div class="list-news fl pt10 "> and not get content in <div class="list-news fl pt10 alternative"> i can using str_replace for remove alternative class but if don't replace this string, how can get all content in every div match class list-news fl pt10.*??

Thanks for idea.

  • 写回答

1条回答 默认 最新

  • douzhuo2722 2014-06-18 01:33
    关注

    A DOM approach (with a naive contains):

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = <<<'EOD'
    //div[
        contains(@class, 'list-news') and
        contains(@class, 'fl') and
        contains(@class, 'pt10')]
    EOD;
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach ($nodes as $node) {
        $results[] = trim($node->textContent);
    
    }
    print_r($results);
    

    A regex approach (with a naive pattern):

    preg_match_all('~<div class="list-news fl pt10\b[^>]+>\s*\K.*?(?=\s*</div>)~',
                   $html, $matches);
    print_r($matches[0]);
    

    The two ways are a little naive because contains doesn't care about word boundaries and the classes order, and the regex pattern doesn't care about the possible irregularities of an html code.

    The reason your pattern doesn't work is that you can't obtain overlapping matches. Since the first occurrence ends with <div class="list-news..., the next occurrence can't begin with the same <div class="list-news... that has been already matched.

    Putting the last <div class="list-news... in a lookahead (?=...) (that is only a check and where the content is not a part of the match result) can be a way. However, it is more simple to use the closing tag </div>.

    \K is used to remove all that has been matched before (on the left) from the match result.

    A good compromise can be to extract all the div tags that contain a class attribute, and after to check with a regex if the attribute value is really what you want before extracting and triming the text content:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    $query = '//div[@class]';
    
    $nodes = $xpath->query($query);
    
    $results = array();
    
    foreach($nodes as $node) {
        if ( preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    

    or without XPath:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $divs = $dom->getElementsByTagName('div');
    
    $results = array();
    
    foreach($divs as $node) {
        if ( $node->hasAttribute('class') &&
             preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
                        $node->getAttribute('class')) )
            $results = trim($node->textContent);
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 WPF 大屏看板表格背景图片设置
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示