drkjzk3359 2012-04-03 19:14
浏览 285
已采纳

如何正则表达式刮取HTML并忽略代码中的空格和换行符?

I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.

For example, here's how the page may present a result in HTML:

<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>

How would I change the following regex to ignore the spaces and new lines:

$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';

Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!

  • 写回答

3条回答 默认 最新

  • dongpengqin3898 2012-04-03 19:19
    关注

    Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:

    $regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';
    

    Update: Here is the DOM Parser based code to get what you want:

    $html = <<< EOF
    <td class="things">
        <div class="stuff">
            <p>I need to capture this text.</p>
        </div>
    </td>
    EOF;
    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($html); // loads your html
    $xpath = new DOMXPath($doc);
    $nodelist = $xpath->query("//td[@class='things']/div[@class='stuff']/p");
    for($i=0; $i < $nodelist->length; $i++) {
        $node = $nodelist->item($i);
        $val = $node->nodeValue;
        echo "$val
    "; // prints: I need to capture this text.
    }
    

    And now please refrain from parsing HTML using regex in your code.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥28 微信小程序开发页面布局没问题,真机调试的时候页面布局就乱了
  • ¥15 python的qt5界面
  • ¥15 无线电能传输系统MATLAB仿真问题
  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀