PHP Regex,匹配具有条件的两个特定单词/标签之间的任何内容

I'm poor in regex, here is my scenario,

I'm trying to extract some info from a webpage which contains several tables, only some of the tables contains a unique url (let's say "very/unique.key"), so it will looks like this:

<table ....>
   (bunch of content)
</table>

<table ....>
   (bunch of content)
</table>

<table ....>
   (bunch of content + "very/unique.key" keyword)
</table>

<table ....>
   (bunch of content)
</table>

<table ....>
   (bunch of content + "very/unique.key" keyword)
</table>

So what I want is to extract all tables' content that contains the "very/unique.key" keyword. And here are the pattern that I have tried:

$pattern = "#<table[^>]+>((?!\<table)(?=very\/unique\.key).*)<\/table>#i";

This returns nothing to me....

$pattern = "#<table[^>]+>((?!<table).*)<\/table>#i";

This will return me everything from table 1's open tag <table...> till the last table's close tag </table> even with the (?!<table) condition...

Appreciate anyone who are willing to help me on this, thanks.

--EDIT--

Here is the solution that I found using DOM to loop through every table

--My Solution--

    $index;//indexes of all the table(s) that contains the keyword
        $cd = 0;//counter

        $DOM = new DOMDocument();
        $DOM->loadHTMLFile("http://uni.corp/sub/sub/target.php?key=123");
        $xpath = new DomXPath($DOM);
        $tables = $DOM->getElementsByTagName("table");
        for ($n = 0; $n < $tables->length; $n++) {
            $rows = $tables->item($n)->getElementsByTagName("tr");
            for ($i = 0; $i < $rows->length; $i++) {
                $cols = $rows->item($i)->getElementsbyTagName("td");
                for ($j = 0; $j < $cols->length; $j++) {


                     $td = $cols->item($j); // grab the td element
                     $img = $xpath->query('./img',$td)->item(0); // grab the first direct img child element


                    if(isset($img) ){
                        $image = $img->getAttribute('src'); // grab the source of the image
                        echo $image;
                        if($image == "very/unique.key"){
                            echo $cols->item($j)->nodeValue, "\t";
                            $index[$cd] = $n;
                            if($n > $cd){
                                $cd++;
                            }


                            echo $cd . " " . $n;//for troubleshooting
                        }


                    }

                }
                echo "<br/>";
            }
        }   

        //loop that echo out only the table(s) that I want which contains the keyword
        $loop = sizeof($index);
        for ($n = 0; $n < $loop; $n++) {
            $temp = $index[$n];
            $rows = $tables->item($temp)->getElementsbyTagName("tr");
            for ($i = 0; $i < $rows->length; $i++) {
                $cols = $rows->item($i)->getElementsbyTagName("td");                
                for ($j = 0; $j < $cols->length; $j++) {
                    echo $cols->item($j)->nodeValue, "\t";
                    //proccess the extracted table content here
                }
                //echo "<br/>";
            }
        }

But personally, I'm still curious about the Regex part, wish anyone could found the solution of the regex pattern for this question. Anyway, thanks to everyone who are helping/advising me on this (especially to AbsoluteƵERØ).

dsrruefh12970
dsrruefh12970 我正在尝试解析/提取的网页是一个动态页面,它使用AJAX/php/JS生成动态内容。因此,网页中的大多数html元素没有任何唯一标识符,如id/class。因为内容是动态的,所以我觉得使用DOM进行解析可能要困难得多,尽管我的正则表达式也很差。这是一个内部网页面,我用它来解析一定数量的信息,不应该超过1个月(我猜)。感谢您回复我并感谢任何人都可以使用DOM或Regex或其他任何东西来启发我。
大约 7 年之前 回复
dtkvlj5386
dtkvlj5386 这似乎是XY问题。您真正的问题是如何获取包含所提及字符串的表元素。使用正则表达式只是一个解决方案,此外并没有真正加起来。
大约 7 年之前 回复
dongrui6787
dongrui6787 不要使用正则表达式来解析HTML。你不能用正则表达式可靠地解析HTML,你将面临悲伤和挫折。一旦HTML改变了您的期望,您的代码就会被破坏。有关如何使用已经编写,测试和调试的PHP模块正确解析HTML的示例,请参阅htmlparsing.com/php。
大约 7 年之前 回复
doupa1883
doupa1883 为什么首先使用正则表达式?怎么样:php.net/manual/en/class.domelement.php
大约 7 年之前 回复

2个回答

Though I agree with the comments on your post, I will give the solution. If you wanted to replace the very/unique.key by something else, the correct regex would look something like this

#<table(.*)>((.*)very\/unique\.key(.*))<\/table>#imsU

The key here is to use the correct modifiers to make it work with your input string. FOr more information on these modifiers, see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Now here's an example where I replace the very/unique.key by "foobar"

<?php
$string = "
<table ....>
   (bunch of content)
</table>

<table ....>
   (bunch of content)
</table>

<table ....>
   bunch of content very/unique.key 
</table>

<table ....>
   (bunch of content)
</table>

<table ....>
   blabla very/unique.key
</table>
";

$pattern = '#<table(.*)>((.*)very\/unique\.key(.*))<\/table>#imsU';

echo preg_replace($pattern, '<table$1>$3foobar$4</table>', $string);
?>

This code prints exactly the same string but with the two "very/unique.key" replaced by "foobar", just like we want.

Though this solution could work, it's certainly not the most efficient nor the easiest work with. Like Mehdi said in the comments, PHP has an extension specifically made to operate on XML (thus HTML).

Here's a link to the documentation of that extension http://www.php.net/manual/en/intro.dom.php

Using that, you could easily go through each table elements and find the ones that have the unique key.

douye5949
douye5949 我编辑了我的答案以提供更多细节和更完整的正则表达式:)
大约 7 年之前 回复
douba3975
douba3975 感谢您的回复,但不幸的是,它没有给我任何回报:(无论如何,再次感谢愿意提供帮助
大约 7 年之前 回复

This works in PHP5. We parse the tables and the use preg_match() to check for the key. The reason you would want to use a method like this is because HTML does not have to be written syntactically correct unlike XML. Because of this you may not actually have proper closing tags. Additionally you may have nested tables which would give you multiple results trying to match opening and closing tags with REGEX. This way we're only checking for the key itself and not good form of the document being parsed.

<?php

$input = "<html>
<table id='1'>
<tr>
<td>This does not contain the key.</td>
</tr>
</table>
<table id='2'>
<tr>
<td>This does contain the unique.key!</td>
</tr>
</table>

<table id='3'>
<tr>
<td>This also contains the unique.key.</td>
</tr>
</table>

</html>";

$html = new DOMDocument;
$html->loadHTML($input);

$findings = array();

$tables = $html->getElementsByTagName('table');
foreach($tables as $table){

    $element = $table->nodeValue;

    if(preg_match('!unique\.key!',$element)){
        $findings[] = $element;
    }
}

print_r($findings);

?>

Output

Array
(
    [0] => This does contain the unique.key!
    [1] => This also contains the unique.key.
)
dongsui8162
dongsui8162 正如我上面提到的,它是一个像“sub / filename.ext”的网址,不知道为什么正则表达式不起作用并且没有给我任何回报,但是由于指导我使用DOM,我找到了一个解决方案来获得整体 表内容,检查我的答案,谢谢。
大约 7 年之前 回复
douchengfei3985
douchengfei3985 什么是关键? 你正确地逃脱了吗?
大约 7 年之前 回复
doulu2011
doulu2011 之前做过一些白痴错误,现在DOM部分工作得很好,只需要找出正则表达式部分,这是行不通的。 啊,正则表达式,我讨厌正则表达式...无论如何,再次感谢。
大约 7 年之前 回复
doufei1893
doufei1893 也是DOM的新手,但使用它们似乎都很好。 但我有2个问题,1)我需要具有唯一键而不是键的整个表内容,2)我可以将“uni.corp / sub / sub / target.php?key = 123”之类的内容添加为 loadHTML()的输入? 因为它向我返回一个空数组,即使我尝试在没有任何条件的情况下遍历每个元素。 但无论如何,感谢你指点一个新的方向,这很有帮助,再次感谢。
大约 7 年之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问