douxi3977 2017-06-25 12:31
浏览 78

爬网由asp.net / AJAX(__doPostBack)管理的HTML分页

I've read a lot about scrapping a website managed by javascript & ASP.net and I learnt that first of all you have to send as many informations as possible in order to cheat the ASP server into believing that you really clicked the pagination.

This is what I'm trying to reach: enter image description here

Or the button next: enter image description here

So I've tried my best but I see that only my first page is crawled. I can never access the second, third etc pages.

Everything is going well, my only problem is that I can't access other pages!

To this point I'm wondering if I'm doing something wrong with my go code or if I'll have to resign and tell myself "ok that can't be scraped".

I'm using a client := &http.Client{} in order to be able to change slightly the header:

    req, err := http.NewRequest("POST", urlToScrap, strings.NewReader(form.Encode()))
    if err != nil {
        panic(err)
    }
    req.Header.Set("X-MicrosoftAjax", "Delta=true")
    req.Header.Set("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36")

    res, err := client.Do(req)
    if err != nil {
        panic(err)
    }

From this point on, here is the Dataform I'm trying to send within my POST request:

form.Add("__EVENTTARGET", "")
form.Add("_TSM_HiddenField_", "2GFwlGU9ATlFIxrdsXRzcja58_1t5F8HSleaZM4ZQwk1")
form.Add("__EVENTVALIDATION", eventvalidation)
form.Add("__VIEWSTATEGENERATOR", "20C6E8CA")
form.Add("__VIEWSTATE", viewstat)

I've directly copied pasted the VIEWSTATE & EVENTVALIDATION from the network into a variable (it's really huge!)

So event target is blank as I was putting my crawler within a for (I'm using GoQuery), that is running until I reach the last page (I know precisely how many pages I want to crawl):

for page := 1; page < 139; page++ {

    urlPaginated := "ctl00$ContentPlaceHolder1$pager$rptPager$ctl" + strconv.Itoa(page) + "$lbtnClick"
    form.Set("__EVENTTARGET", urlPaginated)

The $ctl argument is the only one I saw changing while clicking on buttons. So I thought it was this one modifying the content loaded from the url.

And then, I do my scraping:

    doc, err := goquery.NewDocumentFromResponse(res)
    if err != nil {
        fmt.Println("ok2")
        log.Fatal(err)
    }

    doc.Find(".resultstable tbody tr").Each(func(i int, s *goquery.Selection) {
        companyID, ok := s.Find("td > a").Attr("name")
        if !ok {
            fmt.Println("yolo")
        }

        fmt.Println(companyID)
        scrapIt(companyID)
        time.Sleep(time.Second / 2)
    })

The only field I didn't try to pass to the form are those ones:

enter image description here

So Here I am, lost and clueless. If anyone has idea I would be grateful!

  • 写回答

1条回答 默认 最新

  • donglvmang8638 2017-06-26 13:01
    关注

    So I didn't find a way to go through it but I found a simple LoadMore button on mobile that bypass the main problem.

    So it's a bit awkward to crawl the mobile version instead, but it works.

    评论

报告相同问题?

悬赏问题

  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集
  • ¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)(相关搜索:51单片机|单片机|测试代码)
  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题
  • ¥30 ros小车路径规划实现不了,如何解决?(操作系统-ubuntu)