dtn55928 2014-12-09 22:50
浏览 24
已采纳

PHP Regex在HTML标记之间检索文本,但不检索标记

Similar question might be asked many times but I have a bit complex one.
I know when we want to parse only the text between <title> tag in this scenario,

<title>My work</title>
<p>This is my work.</p> <p>Learning regex.</p>

we can form a Regex like this:

>([^<]*)<

Source

But that works only because the <title> tag is on the top. But if the tag is the second one, it won't work.
Okay, my scenario is,

<td class="td1" headers="searchth1">JAVA1</td>
<td class="td2" headers="searchth2">JAVA2</td>
<td class="td3" headers="searchth3">JAVA3</td>

<td class="td1" headers="searchth1">PHP1</td>
<td class="td2" headers="searchth2">PHP2</td>
<td class="td3" headers="searchth3">PHP3</td>

There are many similar tags in the file, and I want to retrieve only the text between <td class="td1" headers="searchth1"> and </td> tags.
And, I've used '#<td class="td1" headers="searchth1">(.*)</td>#' , which is working fine. But it is also including all other <td> tags in the output, which I don't want.
I want only the texts Java1 and PHP1 and I guess if I could able to retrieve the text between the tags by excluding the tags, I may acieve it.
Am I correct? or Wrong? If so, how to achieve what I want?
Thanks in advance!!

  • 写回答

2条回答 默认 最新

  • douh9817 2014-12-09 23:07
    关注

    I think your regex approach, while technically possible, is going to cause more trouble down the line. For example, if the source HTML changed so the headers attribute appeared before the class attribute the regex would fail. Also, your code will become pretty unreadable very quickly if you're using regex to search through HTML source code.

    To parse HTML you should use PHP's DOMDocument functions, which are more robust in the face of changing HTML code and are far more readable to whoever may be maintaining your code (including you). This method will also support looking at other element attributes more easily. The sample code below should work for your use case:

    $doc = '<td class="td1" headers="searchth1">JAVA1</td>
    <td class="td2" headers="searchth2">JAVA2</td>
    <td class="td3" headers="searchth3">JAVA3</td>
    <td class="td1" headers="searchth1">PHP1</td>
    <td class="td2" headers="searchth2">PHP2</td>
    <td class="td3" headers="searchth3">PHP3</td>';
    $dom = new DOMDocument();
    $dom->loadHTML($doc);
    $xpath = new DOMXpath($dom);
    $tds = $xpath->query("//td[@class='td1']");
    // the query could also be "//td[@headers='searchth1']" or even
    // "//td[@headers='searchth1'][@class='td1']" depending on what you want to target
    foreach($tds as $td){
        var_dump($td->nodeValue);
    }
    

    If you want to learn more about building and using xpath queries, I suggest the article PHP DOM: Using XPath over at SitePoint.com.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 python中合并修改日期相同的CSV文件并按照修改日期的名字命名文件
  • ¥15 有赏,i卡绘世画不出
  • ¥15 如何用stata画出文献中常见的安慰剂检验图
  • ¥15 c语言链表结构体数据插入
  • ¥40 使用MATLAB解答线性代数问题
  • ¥15 COCOS的问题COCOS的问题
  • ¥15 FPGA-SRIO初始化失败
  • ¥15 MapReduce实现倒排索引失败
  • ¥15 ZABBIX6.0L连接数据库报错,如何解决?(操作系统-centos)
  • ¥15 找一位技术过硬的游戏pj程序员