dti70601 2013-07-26 18:06
浏览 69

PHP Regex,匹配具有条件的两个特定单词/标签之间的任何内容

I'm poor in regex, here is my scenario,

I'm trying to extract some info from a webpage which contains several tables, only some of the tables contains a unique url (let's say "very/unique.key"), so it will looks like this:

<table ....>
   (bunch of content)
</table>

<table ....>
   (bunch of content)
</table>

<table ....>
   (bunch of content + "very/unique.key" keyword)
</table>

<table ....>
   (bunch of content)
</table>

<table ....>
   (bunch of content + "very/unique.key" keyword)
</table>

So what I want is to extract all tables' content that contains the "very/unique.key" keyword. And here are the pattern that I have tried:

$pattern = "#<table[^>]+>((?!\<table)(?=very\/unique\.key).*)<\/table>#i";

This returns nothing to me....

$pattern = "#<table[^>]+>((?!<table).*)<\/table>#i";

This will return me everything from table 1's open tag <table...> till the last table's close tag </table> even with the (?!<table) condition...

Appreciate anyone who are willing to help me on this, thanks.

--EDIT--

Here is the solution that I found using DOM to loop through every table

--My Solution--

    $index;//indexes of all the table(s) that contains the keyword
        $cd = 0;//counter

        $DOM = new DOMDocument();
        $DOM->loadHTMLFile("http://uni.corp/sub/sub/target.php?key=123");
        $xpath = new DomXPath($DOM);
        $tables = $DOM->getElementsByTagName("table");
        for ($n = 0; $n < $tables->length; $n++) {
            $rows = $tables->item($n)->getElementsByTagName("tr");
            for ($i = 0; $i < $rows->length; $i++) {
                $cols = $rows->item($i)->getElementsbyTagName("td");
                for ($j = 0; $j < $cols->length; $j++) {


                     $td = $cols->item($j); // grab the td element
                     $img = $xpath->query('./img',$td)->item(0); // grab the first direct img child element


                    if(isset($img) ){
                        $image = $img->getAttribute('src'); // grab the source of the image
                        echo $image;
                        if($image == "very/unique.key"){
                            echo $cols->item($j)->nodeValue, "\t";
                            $index[$cd] = $n;
                            if($n > $cd){
                                $cd++;
                            }


                            echo $cd . " " . $n;//for troubleshooting
                        }


                    }

                }
                echo "<br/>";
            }
        }   

        //loop that echo out only the table(s) that I want which contains the keyword
        $loop = sizeof($index);
        for ($n = 0; $n < $loop; $n++) {
            $temp = $index[$n];
            $rows = $tables->item($temp)->getElementsbyTagName("tr");
            for ($i = 0; $i < $rows->length; $i++) {
                $cols = $rows->item($i)->getElementsbyTagName("td");                
                for ($j = 0; $j < $cols->length; $j++) {
                    echo $cols->item($j)->nodeValue, "\t";
                    //proccess the extracted table content here
                }
                //echo "<br/>";
            }
        }

But personally, I'm still curious about the Regex part, wish anyone could found the solution of the regex pattern for this question. Anyway, thanks to everyone who are helping/advising me on this (especially to AbsoluteƵERØ).

  • 写回答

2条回答

  • dqa35710 2013-07-26 18:44
    关注

    Though I agree with the comments on your post, I will give the solution. If you wanted to replace the very/unique.key by something else, the correct regex would look something like this

    #<table(.*)>((.*)very\/unique\.key(.*))<\/table>#imsU
    

    The key here is to use the correct modifiers to make it work with your input string. FOr more information on these modifiers, see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

    Now here's an example where I replace the very/unique.key by "foobar"

    <?php
    $string = "
    <table ....>
       (bunch of content)
    </table>
    
    <table ....>
       (bunch of content)
    </table>
    
    <table ....>
       bunch of content very/unique.key 
    </table>
    
    <table ....>
       (bunch of content)
    </table>
    
    <table ....>
       blabla very/unique.key
    </table>
    ";
    
    $pattern = '#<table(.*)>((.*)very\/unique\.key(.*))<\/table>#imsU';
    
    echo preg_replace($pattern, '<table$1>$3foobar$4</table>', $string);
    ?>
    

    This code prints exactly the same string but with the two "very/unique.key" replaced by "foobar", just like we want.

    Though this solution could work, it's certainly not the most efficient nor the easiest work with. Like Mehdi said in the comments, PHP has an extension specifically made to operate on XML (thus HTML).

    Here's a link to the documentation of that extension http://www.php.net/manual/en/intro.dom.php

    Using that, you could easily go through each table elements and find the ones that have the unique key.

    评论

报告相同问题?

悬赏问题

  • ¥15 MATLAB动图的问题
  • ¥15 求差集那个函数有问题,有无佬可以解决
  • ¥15 【提问】基于Invest的水源涵养
  • ¥20 微信网友居然可以通过vx号找到我绑的手机号
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名