douyi6168 2014-04-05 13:11
浏览 78
已采纳

用curl打开url,点击ajax按钮,等待并获得响应html

I am going to scrape http://www.car4you.at/Haendlersuche it shows 20 results first time and pagination. I scrape 20 links successfully but facing problem to get link to next page because there is no link in href of pagination. It contains a javascript function.

href="javascript:AjaxCallback_ResList('ResultList', 'Pager', '1_1874')"

My question is how can I load page with curl then click on next page button, wait for response then parse it.

Here is what I am trying

function of curl

function postCurlReq($loginActionUrl,$parameters,$referer)
{
        curl_setopt ($this->curl, CURLOPT_URL,$loginActionUrl); 
        curl_setopt ($this->curl, CURLOPT_POST, 1); 
        curl_setopt ($this->curl, CURLOPT_POSTFIELDS, $parameters); 
        curl_setopt ($this->curl, CURLOPT_COOKIEJAR, realpath('cookie.txt')); // cookie.txt should be in same directoy, where calling script is 
        curl_setopt ($this->curl, CURLOPT_COOKIEFILE, realpath('cookie.txt'));
        curl_setopt ($this->curl, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt ($this->curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0');            
        curl_setopt ($this->curl, CURLOPT_REFERER, $referer);   // set referer
        curl_setopt ($this->curl, CURLOPT_SSL_VERIFYPEER, FALSE);// ssl certificate
        curl_setopt ($this->curl, CURLOPT_SSL_VERIFYHOST, 2);
        $result['EXE'] = curl_exec($this->curl);
        $result['INF'] = curl_getinfo($this->curl);
        $result['ERR'] = curl_error($this->curl);
        return $result;                 
}

and tried code is for pagination

$loginUrl = "http://www.car4you.at/Haendlersuche";
$parameters = array("href" => "javascript:AjaxCallback_ResList('ResultList', 'Pager', '1_1874')");
$referer = "http://www.car4you.at/Haendlersuche";

$loginHTML = $crawler->postCurlReq($loginUrl,$parameters,$referer);

if ( empty($loginHTML['ERR']) ) { // if no error occure in opening url

print_r($loginHTML['EXE']);

}

second method to scrape is select list that show results like 10 20 50 if my script succeed to select 50 then it ll also be happy work and tried code is for select list

$loginUrl = "http://www.car4you.at/Haendlersuche";
$parameters = array("value" => "50");
$referer = "http://www.car4you.at/Haendlersuche";

$loginHTML = $crawler->postCurlReq($loginUrl,$parameters,$referer);

if ( empty($loginHTML['ERR']) ) { // if no error occure in opening url

print_r($loginHTML['EXE']);

}
  • 写回答

1条回答 默认 最新

  • duanqi5114 2014-04-05 13:31
    关注

    When scraping a site you aren't running a browser, just picking up the HTML response from the site. This means that you can't just run JavaScript code, you'd have to parse it yourself, or perhaps use a library to parse it for you.

    However any AJAX buttons that fetch more results are just calling another URL (perhaps with GET or POST variables), and themselves parsing the result, or sticking it somewhere in the HTML of the page. You can work out what URL calls are being made using Developer Tools in Chrome, or Firebug etc.. Then you can scrape these URLs instead of the original one, to extract the information.

    In this particular case it is quite tricky because there are a number of POST variables on the AJAX request, and spotting the pattern isn't trivial, but it is possible, and probably easier than trying to emulate the JavaScript.

    In general, if you really really want to simulate the running of JavaScript in scraping, it is possible to run a browser, and interact with it programatically. This is what Selenium does, and I suspect something like this could be done fairly painlessly with Selenium. It's probably still easier to do it by sniffing the AJAX request though.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集
  • ¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)(相关搜索:51单片机|单片机|测试代码)
  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题
  • ¥30 ros小车路径规划实现不了,如何解决?(操作系统-ubuntu)
  • ¥20 matlab yalmip kkt 双层优化问题
  • ¥15 如何在3D高斯飞溅的渲染的场景中获得一个可控的旋转物体