dongyun7897
dongyun7897
2015-06-04 05:02

从Google搜索结果页面提取网址

已采纳

I'm trying to grab all the URLs off of a Google search page and there are two ways I think I could do it, but I don't really have any idea how to do them.

First, I could simply scrape them from the .r tags and get the href attribute for each link. However, this gives me a really long string that I would have to parse through to get the URL. Here's an example of what would have to be parsed through:

https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA

The URL I would want out of this would be:

https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

So I would have to create a string between the https and &sa which I'm not 100% sure how to do because each really long string Google gives me is a different size so just using slice and cutting it up "x" amount of characters wouldn't work.

Second, underneath each link in a Google search there is the URL in green text. Right clicking that and inspecting the element gives: cite class="_Rm" (between chevrons) which I don't know how to find with goquery because looking for cite with my small function just gives me more long strings of characters.

Here is my small function, it currently does the first option without the parsing and gives me a long string of text that just takes me to the search page:

func GetUrls(url string) {

    doc, err := goquery.NewDocument(url)

    if err != nil {
        panic(err)
    }

    doc.Find(".r").Each(func(i int, s *goquery.Selection) {

        doc.Find(".r a").Each(func(i int, s *goquery.Selection) {
            Link, _ := s.Attr("href")
            Link = url + Link
            fmt.Printf("link is [%s]
", Link)
        })

    })

}
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • douyejv820598 douyejv820598 6年前

    The standard library has support parsing URLs. Check out the net/url package. Using this package, we can get query parameters from URLs.

    Note that your original raw URL contains the URL you want to extract in the "aqs" parameter in the form of

    chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    

    Which is basically another URL.

    Let's write a little helper function which gets a parameter from a raw URL text:

    func getParam(raw, param string) (string, error) {
        u, err := url.Parse(raw)
        if err != nil {
            return "", err
        }
    
        q := u.Query()
        if q == nil {
            return "", fmt.Errorf("No query part")
        }
    
        v := q.Get(param)
        if v == "" {
            return "", fmt.Errorf("Param not found")
        }
        return v, nil
    }
    

    Using this we can get the "aqs" parameter from the original URL, and using this again we can get the "q" parameter which is exactly your desired URL:

    raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA"
    aqs, err := getParam(raw, "aqs")
    if err != nil {
        panic(err)
    }
    fmt.Println(aqs)
    
    result, err := getParam(aqs, "q")
    fmt.Println(result)
    

    Output (try it on the Go Playground):

    chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    
    点赞 评论 复制链接分享

相关推荐