ds342222 2018-05-22 21:19
浏览 86
已采纳

简单的html dom总是加载默认的第一页而不是指定的url

I want to scrape few web pages. I am using php and simple html dom parser. For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5

I use this load the url.

$html = new simple_html_dom();
$html->load_file($url);

This loads the correct page. Then I find the next page link, here it will be: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6

Just the page value is changed from 5 to 6. The code snippet to get the next link is:

function getNextLink($_htmlTemp)
{
    //Getting the next page links
    $aNext = $_htmlTemp->find('a.next', 0);
    $nextLink = $aNext->href;    
    return $nextLink;
}

The above method returns the correct link with page value being 6. Now when I try to load this next link, it fetches the first default page with page query absent from the url.

//After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
        $originalHtml->load_file($nxtLink); //This line fetches default page
    }

The whole flow is something like this:

 $html->load_file($url);


//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
    $listings = $originalHtml->find('div.searchResult');    
    foreach($listings as $item)
    {
        //Some logic here
    }


    //After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>';
        $originalHtml->load_file($nxtLink);
    }
    else
    {
        //No next link -- stop the loop as we have covered all the pages
        $shouldLoop = false;
    }

} while($shouldLoop);

I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.

  • 写回答

1条回答 默认 最新

  • dream989898 2018-05-23 06:15
    关注

    You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.

    $url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
    $html = str_get_html(file_get_contents($url));
    
    while($a = $html->find('a.next', 0)){
      $url = html_entity_decode($a->href);
      echo $url . "
    ";
      $html = str_get_html(file_get_contents($url));
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥15 Python3.5 相关代码写作
  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来
  • ¥15 求帮我调试一下freefem代码
  • ¥15 matlab代码解决,怎么运行
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗