简单的HTML Dom Crawler返回的内容多于属性中包含的内容

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.

I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/

In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.

How can I limit the output to only include the data contained within the h2 tag?

Here is the code I am using:

<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');

$target_url = "http://www.theatlantic.com/most-popular/";

$html = new simple_html_dom();

$html->load_file($target_url);

$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
  $post = $posts[$i];
  $post->find('h2[class=hed]',0)->outertext = "";
  echo strip_tags($post, '<p><a>');
  }
  ?>
  </div>

Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douzi1350 2016-02-20 14:17
关注
You are not outputting the h2 contents, but the ul contents in the echo:

echo strip_tags($post, '<p><a>');

Note that the statement before the echo does not modify $post:

$post->find('h2[class=hed]',0)->outertext = "";

Change code to this:

$hed = $post->find('h2[class=hed]',0); echo strip_tags($hed, '<p><a>');

However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:

$posts = $html->find('ul[class=river]'); foreach($posts as $postNum => $post) { if ($postNum >= 10) break; // limit reached $heds = $post->find('h2[class=hed]'); foreach($heds as $hed) { echo strip_tags($hed, '<p><a>'); } }

If you still need to clear outertext, you can do it with $hed:

$hed->outertext = "";
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

简单的HTML Dom Crawler返回的内容多于属性中包含的内容 php
2016-02-20 13:52

回答 2 已采纳 You are not outputting the h2 contents, but the ul contents in the echo: echo strip_tags($post, '
PHP简单的HTML DOM解析器 html php
2014-09-05 08:02

回答 6 已采纳 In this case you can directly point it out with children() method. Example: foreach($html->fin
脚本标记中的Symfony dom-crawler字符串转换为UTF8 php symfony
2016-04-09 17:48

回答 1 已采纳 Let's see how symfony/dom-crawler works. Here's an example to start with: <?php require 've
开发网络爬虫应该选择Nutch、Crawler4j、WebMagic、scrapy、WebCollector还是其他的？这里按照我的经验随便扯淡一下：上面说的爬虫，基本可以分3类
2016-05-31 09:50

D_J_W的博客 2.JAVA单机爬虫：Crawler4j、WebMagic、WebCollector 3. 非JAVA单机爬虫：scrapy 第一类:分布式爬虫爬虫使用分布式，主要是解决两个问题： 1)海量URL管理 2)网速现在比较流行的分布式爬虫...
DOM。从选项标记中的给定文本获取值属性 php
2015-03-24 10:37

回答 2 已采纳 The xpath expression string(//option[.="7-Zip"]/@value) will find any <option> element whose
Foreach爆炸修剪采用最后日期简单的dom解析器 php
2015-08-08 13:44

回答 1 已采纳 After some edits, we found the true relevant structure of the HTML markup needing to be parsed: &
关于PHP中的Web Crawler的错误 php
2011-12-31 13:02

回答 1 已采纳 Flat Loop Example: You initiate the loop with a stack that contains all URLs you'd like to proce
开源爬虫框架各有什么优缺点？
2017-03-29 20:42

Together_CZ的博客转自;... 开源爬虫框架各有什么优缺点？ ... LinkinPark 2015-11-10 3:36:05 ...开发网络爬虫应该选择Nutch、Crawler4j、WebMagic、scrapy、WebCollector还是其他的？这里按照我的经验随便扯淡一下：上面说的爬虫，
我只想在XPath中仅检索body元素的文本时仅排除JavaScript标记内容 php
2017-04-27 11:44

回答 2 已采纳 I would like to suggest you use DomXpath in which you can filter the content. by query. I am not p
在windows系统内使用php中调用Python文件，路径怎么写返回都为空array。 php python 有问必答
2021-06-28 16:01

回答 1 已采纳你这个py文件执行的结果是插入数据到数据库中，你去检查下数据库有没有新数据就知道了呀
simple_html_dom.php内存问题 php
2011-11-26 16:44

回答 2 已采纳 $html->clear; if this is your actual code then you may want to change it to function call: $h
爬虫 - 开发网络爬虫应该怎样选择爬虫框架
2016-01-19 08:56

深夜独影的博客有些人问，开发网络爬虫应该选择Nutch、Crawler4j、WebMagic、scrapy、WebCollector还是其他的？这里按照我的经验随便扯淡一下：上面说的爬虫，基本可以分3类： 1.分布式爬虫：Nutch 2.JAVA单机爬虫：Crawler4j...
如何通过symfony crawler获取当前父节点之后的下一个节点？ html php symfony
2016-12-27 01:01

回答 1 已采纳 Story After some digging into source code, i've found that method nextAll() returns not "all" but
写给小白系列之爬虫篇，爬虫与防爬虫
2020-06-09 19:39

最优杰的博客网页中除了包含供用户阅读的文字信息外，还包含一些超链接信息。Web网络爬虫系统正是通过网页中的超连接信息不断获得网络上的其它网页。正是因为这种采集过程像一个爬虫或者蜘蛛在网络上漫游，所以它才被称为网络...
python爬虫技术作用_大数据爬虫技术有什么功能
2020-12-05 15:20

weixin_39625872的博客展开全部1、爬虫技术概述网络爬虫(Web crawler)，是一种按照一定的规则，自动62616964757a686964616fe59b9ee7ad9431333363373065地抓取万维网信息的程序或者脚本，它们被广泛用于互联网搜索引擎或其他类似网站，可以...
python 爬虫框架scrapy优势_开源爬虫框架各有什么优缺点
2020-12-07 15:18

weixin_39789979的博客展开全部开发网络爬虫32313133353236313431303231363533e78988e69d8331333339663330应该选择Nutch、Crawler4j、WebMagic、scrapy、WebCollector还是其他的？这里按照我的经验随便扯淡一下：上面说的爬虫，基本可以分...
网络爬虫技术总结
2019-09-14 00:49

chuangyi8818的博客网络爬虫（Web crawler），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本，它们被广泛用于互联网搜索引擎或其他类似网站，可以自动采集所有其能够访问到的页面内容，以获取或更新这些网站的内容和检索...
网络爬虫
2019-05-26 16:47

e_123456457的博客网络爬虫（Web crawler），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本，它们被广泛用于互联网搜索引擎或其他类似网站，可以自动采集所有其能够访问到的页面内容，以获取或更新这些网站的内容和检索...
网络爬虫技术
2019-04-24 17:20

chenrui310的博客网络爬虫（Web crawler），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本，它们被广泛用于互联网搜索引擎或其他类似网站，可以自动采集所有其能够访问到的页面内容，以获取或更新这些网站的内容和检索...
开发网络爬虫应该怎样选择爬虫框架
2019-09-18 10:09

chunjiushi9898的博客有些人问，开发网络爬虫应该选择Nutch、Crawler4j、WebMagic、scrapy、WebCollector还是其他的?这里按照我的经验随便扯淡一下：上面说的爬虫，基本可以分3类： 1.分布式爬虫：Nutch 2.JAVA单机爬虫：Crawler4...
没有解决我的问题, 去提问

悬赏问题

¥15 不是，这到底错哪儿了😭
¥15 2020长安杯与连接网探
¥15 关于#matlab#的问题：在模糊控制器中选出线路信息，在simulink中根据线路信息生成速度时间目标曲线（初速度为20m/s，15秒后减为0的速度时间图像）我想问线路信息是什么
¥15 banner广告展示设置多少时间不怎么会消耗用户价值
¥16 mybatis的代理对象无法通过@Autowired装填
¥15 可见光定位matlab仿真
¥15 arduino 四自由度机械臂
¥15 wordpress 产品图片 GIF 没法显示
¥15 求三国群英传pl国战时间的修改方法
¥15 matlab代码代写，需写出详细代码，代价私

简单的HTML Dom Crawler返回的内容多于属性中包含的内容

2条回答 默认 最新

悬赏问题

2条回答默认最新