ds342222 2018-05-22 13:19
浏览 86
已采纳

简单的html dom总是加载默认的第一页而不是指定的url

I want to scrape few web pages. I am using php and simple html dom parser. For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5

I use this load the url.

$html = new simple_html_dom();
$html->load_file($url);

This loads the correct page. Then I find the next page link, here it will be: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6

Just the page value is changed from 5 to 6. The code snippet to get the next link is:

function getNextLink($_htmlTemp)
{
    //Getting the next page links
    $aNext = $_htmlTemp->find('a.next', 0);
    $nextLink = $aNext->href;    
    return $nextLink;
}

The above method returns the correct link with page value being 6. Now when I try to load this next link, it fetches the first default page with page query absent from the url.

//After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
        $originalHtml->load_file($nxtLink); //This line fetches default page
    }

The whole flow is something like this:

 $html->load_file($url);


//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
    $listings = $originalHtml->find('div.searchResult');    
    foreach($listings as $item)
    {
        //Some logic here
    }


    //After loop we will have details of all the listing in this page -- so get next page link
    $nxtLink = getNextLink($originalHtml);  //Returns string url
    if(!empty($nxtLink))
    {
        //Yay, we have the next link -- load the next link        
        print 'Next Url: '.$nxtLink.'<br>';
        $originalHtml->load_file($nxtLink);
    }
    else
    {
        //No next link -- stop the loop as we have covered all the pages
        $shouldLoop = false;
    }

} while($shouldLoop);

I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.

展开全部

  • 写回答

1条回答 默认 最新

  • dream989898 2018-05-22 22:15
    关注

    You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.

    $url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
    $html = str_get_html(file_get_contents($url));
    
    while($a = $html->find('a.next', 0)){
      $url = html_entity_decode($a->href);
      echo $url . "
    ";
      $html = str_get_html(file_get_contents($url));
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
编辑
预览

报告相同问题?

悬赏问题

  • ¥15 PADS Logic 原理图
  • ¥15 PADS Logic 图标
  • ¥15 电脑和power bi环境都是英文如何将日期层次结构转换成英文
  • ¥20 气象站点数据求取中~
  • ¥15 如何获取APP内弹出的网址链接
  • ¥15 wifi 图标不见了 不知道怎么办 上不了网 变成小地球了