douchengchen7959 2012-03-18 12:57
浏览 142
已采纳

使用php DOMDocument从网页中提取文本

I have the following script that works almost fine except two things:

  • I still have unknows tags such as <note>, <to>, or <?xml version="1.0" encoding="ISO-8859-1"?>
  • I also have javascript script, i've tried to exclude them with //text()[not(self::script)] but this breaks the xpath

Script:

$contents = file_get_contents("http://www.w3schools.com/php/php_xml_dom.asp");
$dom = new DOMDocument();
@$dom->loadHTML($contents);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXPath($dom);
// see http://www.w3schools.com/xpath/xpath_syntax.asp
$hrefs = $xpath->evaluate("//text()");
for ($i = 0; $i < $hrefs->length; $i++)
  echo $hrefs->item($i)->nodeValue;

Do you have a better solution to extract text from a webpage ?

Note: I could simply use strip_tags, but I want to stick with DOMDocument.

  • 写回答

1条回答 默认 最新

  • doupao3662 2012-03-18 13:01
    关注

    I've always used this http://simplehtmldom.sourceforge.net/ and every time with success.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 乌班图ip地址配置及远程SSH
  • ¥15 怎么让点阵屏显示静态爱心,用keiluVision5写出让点阵屏显示静态爱心的代码,越快越好
  • ¥15 PSPICE制作一个加法器
  • ¥15 javaweb项目无法正常跳转
  • ¥15 VMBox虚拟机无法访问
  • ¥15 skd显示找不到头文件
  • ¥15 机器视觉中图片中长度与真实长度的关系
  • ¥15 fastreport table 怎么只让每页的最下面和最顶部有横线
  • ¥15 R语言卸载之后无法重装,显示电脑存在下载某些较大二进制文件行为,怎么办
  • ¥15 java 的protected权限 ,问题在注释里