dousi6087 2016-02-20 13:52
浏览 50
已采纳

简单的HTML Dom Crawler返回的内容多于属性中包含的内容

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.

I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/

In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.

How can I limit the output to only include the data contained within the h2 tag?

Here is the code I am using:

<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');

$target_url = "http://www.theatlantic.com/most-popular/";

$html = new simple_html_dom();

$html->load_file($target_url);

$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
  $post = $posts[$i];
  $post->find('h2[class=hed]',0)->outertext = "";
  echo strip_tags($post, '<p><a>');
  }
  ?>
  </div>

Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.

  • 写回答

2条回答 默认 最新

  • douzi1350 2016-02-20 14:17
    关注

    You are not outputting the h2 contents, but the ul contents in the echo:

    echo strip_tags($post, '<p><a>');
    

    Note that the statement before the echo does not modify $post:

    $post->find('h2[class=hed]',0)->outertext = "";
    

    Change code to this:

    $hed = $post->find('h2[class=hed]',0);
    echo strip_tags($hed, '<p><a>');
    

    However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:

    $posts = $html->find('ul[class=river]');
    foreach($posts as $postNum => $post) {
        if ($postNum >= 10) break; // limit reached
        $heds = $post->find('h2[class=hed]');
        foreach($heds as $hed) {
            echo strip_tags($hed, '<p><a>');
        }
    }
    

    If you still need to clear outertext, you can do it with $hed:

    $hed->outertext = "";
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示
  • ¥15 求三国群英传pl国战时间的修改方法
  • ¥15 matlab代码代写,需写出详细代码,代价私
  • ¥15 ROS系统搭建请教(跨境电商用途)
  • ¥15 AIC3204的示例代码有吗,想用AIC3204测量血氧,找不到相关的代码。