dsaf415212 2015-01-14 16:54
浏览 6
已采纳

too long

I want to parse different web pages so that I can form an inverted index. I want to read only the text, not the a tag elements,menu, etc. Is it possible to do this? Here is what I have so far:

 <?php
 $ch = curl_init("http://en.wikipedia.org/wiki/Agile_software_development");
 curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
 $c1 = curl_exec($ch);
 $dom = new DOMDocument();
 @$dom->loadHTML($c1);

 $links = $dom->getElementsByTagName("body");
 echo "<br>";

 foreach($links as $links) {
    $title = $links->getElementsBytagName("a");
    $l= $title->length;
    echo $link->nodeValue;
    echo"<br>";
 } ?>
  • 写回答

2条回答 默认 最新

  • douji6896 2015-01-16 12:15
    关注

    I would do it like this:

    <?php
    $html = <<<HTML
    <html>
      <head>
        <title>TITLE</title>
      </head>
      <body>
        <p>PARA 1</p>
        <p>PARA <span>2</span></p>
      </body>
    </html>
    HTML;
    
    $dom = new DOMDocument();
    @$dom->loadHtml($html);
    
    var_dump($dom->getElementsByTagName("body")[0]->textContent);
    ?>
    

    The textContent field gives you the contents of the Node itself and of its descendants, in document order. The output of the above is:

    string(25) "
        PARA 1
        PARA 2
      "
    

    If you want to normalize the spaces (replace all sequences of 2 or more spaces with just one space and remove the leading and trailing spaces), then you can do this:

    var_dump(preg_replace('/\s{2,}/', ' ', trim(
                    $dom->getElementsByTagName("body")[0]->textContent)));
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集
  • ¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)(相关搜索:51单片机|单片机|测试代码)
  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题
  • ¥30 ros小车路径规划实现不了,如何解决?(操作系统-ubuntu)