I've read a lot about scrapping a website managed by javascript & ASP.net and I learnt that first of all you have to send as many informations as possible in order to cheat the ASP server into believing that you really clicked the pagination.
This is what I'm trying to reach:
So I've tried my best but I see that only my first page is crawled. I can never access the second, third etc pages.
Everything is going well, my only problem is that I can't access other pages!
To this point I'm wondering if I'm doing something wrong with my go code or if I'll have to resign and tell myself "ok that can't be scraped".
I'm using a client := &http.Client{}
in order to be able to change slightly the header:
req, err := http.NewRequest("POST", urlToScrap, strings.NewReader(form.Encode()))
if err != nil {
panic(err)
}
req.Header.Set("X-MicrosoftAjax", "Delta=true")
req.Header.Set("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36")
res, err := client.Do(req)
if err != nil {
panic(err)
}
From this point on, here is the Dataform I'm trying to send within my POST request:
form.Add("__EVENTTARGET", "")
form.Add("_TSM_HiddenField_", "2GFwlGU9ATlFIxrdsXRzcja58_1t5F8HSleaZM4ZQwk1")
form.Add("__EVENTVALIDATION", eventvalidation)
form.Add("__VIEWSTATEGENERATOR", "20C6E8CA")
form.Add("__VIEWSTATE", viewstat)
I've directly copied pasted the VIEWSTATE & EVENTVALIDATION from the network into a variable (it's really huge!)
So event target is blank as I was putting my crawler within a for (I'm using GoQuery), that is running until I reach the last page (I know precisely how many pages I want to crawl):
for page := 1; page < 139; page++ {
urlPaginated := "ctl00$ContentPlaceHolder1$pager$rptPager$ctl" + strconv.Itoa(page) + "$lbtnClick"
form.Set("__EVENTTARGET", urlPaginated)
The $ctl argument is the only one I saw changing while clicking on buttons. So I thought it was this one modifying the content loaded from the url.
And then, I do my scraping:
doc, err := goquery.NewDocumentFromResponse(res)
if err != nil {
fmt.Println("ok2")
log.Fatal(err)
}
doc.Find(".resultstable tbody tr").Each(func(i int, s *goquery.Selection) {
companyID, ok := s.Find("td > a").Attr("name")
if !ok {
fmt.Println("yolo")
}
fmt.Println(companyID)
scrapIt(companyID)
time.Sleep(time.Second / 2)
})
The only field I didn't try to pass to the form are those ones:
So Here I am, lost and clueless. If anyone has idea I would be grateful!