I'm trying to grab all the URLs off of a Google search page and there are two ways I think I could do it, but I don't really have any idea how to do them.
First, I could simply scrape them from the .r
tags and get the href
attribute for each link. However, this gives me a really long string that I would have to parse through to get the URL. Here's an example of what would have to be parsed through:
The URL I would want out of this would be:
So I would have to create a string between the https
and &sa
which I'm not 100% sure how to do because each really long string Google gives me is a different size so just using slice and cutting it up "x" amount of characters wouldn't work.
Second, underneath each link in a Google search there is the URL in green text. Right clicking that and inspecting the element gives: cite class="_Rm"
(between chevrons) which I don't know how to find with goquery because looking for cite
with my small function just gives me more long strings of characters.
Here is my small function, it currently does the first option without the parsing and gives me a long string of text that just takes me to the search page:
func GetUrls(url string) {
doc, err := goquery.NewDocument(url)
if err != nil {
panic(err)
}
doc.Find(".r").Each(func(i int, s *goquery.Selection) {
doc.Find(".r a").Each(func(i int, s *goquery.Selection) {
Link, _ := s.Attr("href")
Link = url + Link
fmt.Printf("link is [%s]
", Link)
})
})
}