douyudouchao6779 2010-09-09 17:34
浏览 63

使用PHP中的POST变量刮擦ASP.Net网站

For the past few days I have been trying to scrape a website but so far with no luck.

The situation is as following: The website I am trying to scrape requires data from a form submitted previously. I have recognized the variables that are required by the web app and have investigated what HTTP headers are sent by the original web app.

Since I have pretty much zero knowledge in ASP.net, thought I'd just ask whether I am missing something here.

I have tried different methods (CURL, get contents and the Snoopy class), here's my code of the curl method:

<?php
$url = 'http://www.urltowebsite.com/Default.aspx';
$fields = array('__VIEWSTATE' => 'averylongvar',
                '__EVENTVALIDATION' => 'anotherverylongvar',
                'A few' => 'other variables');

$fields_string = http_build_query($fields);

$curl = curl_init($url);

curl_setopt_array
(
    $curl,
    array
    (
        CURLOPT_RETURNTRANSFER  =>    true,
        CURLOPT_SSL_VERIFYPEER  =>    0,  //    Not supported in PHP
        CURLOPT_SSL_VERIFYHOST  =>    0,  //        at this time.
        CURLOPT_HTTPHEADER      =>
            array
            (
                'Content-type: application/x-www-form-urlencoded; charset=utf-8',
                'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
            ),
        CURLOPT_POST            =>    true,
        CURLOPT_POSTFIELDS      =>    $fields_string,
        CURLOPT_FOLLOWLOCATION => 1
    )
);

$response = curl_exec($curl);
curl_close($curl);

echo $response;
?>

The following headers were requested:

Request Headers

  • Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
  • Content-Type:application/x-www-form-urlencoded
  • User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5

Form Data

  • A lot of form fields

Response Headers

  • Cache-Control:private
  • Content-Length:30168
  • Content-Type:text/html; charset=utf-8
  • Date:Thu, 09 Sep 2010 17:22:29 GMT
  • Server:Microsoft-IIS/6.0
  • X-Aspnet-Version:2.0.50727
  • X-Powered-By:ASP.NET

When I investigate the headers of the CURL script that I wrote, somehow does not generate the Form data request. Neither is the request method set to POST. This is where it seems to me where things go wrong, but dunno.

Any help is appreciated!!!

EDIT: I forgot to mention that the result of the scraping is a custom session expired page of the remote website.

  • 写回答

2条回答 默认 最新

  • duanchuang6978 2010-09-09 17:42
    关注

    Since VIEWSTATE contains the state of the page in a particular situation (and all this state is encoded into a big, apparently messy, string), you cannot be sure that the param you are scraping can be the same for your "mock" request (I'm quite sure that it cannot be the same ;) ).

    If you really have to deal with VIEWSTATE and EVENTVALIDATION params my advice is to follow another approach, that is to scrape content via Selenium or with an HtmlUnit like library (but unfortunately I don't know if there's something similar in PHP).

    评论

报告相同问题?

悬赏问题

  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 对于相关问题的求解与代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作
  • ¥15 求NPF226060磁芯的详细资料
  • ¥15 使用R语言marginaleffects包进行边际效应图绘制