dongyun7897 2015-06-04 05:02
浏览 99
已采纳

从Google搜索结果页面提取网址

I'm trying to grab all the URLs off of a Google search page and there are two ways I think I could do it, but I don't really have any idea how to do them.

First, I could simply scrape them from the .r tags and get the href attribute for each link. However, this gives me a really long string that I would have to parse through to get the URL. Here's an example of what would have to be parsed through:

https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA

The URL I would want out of this would be:

https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

So I would have to create a string between the https and &sa which I'm not 100% sure how to do because each really long string Google gives me is a different size so just using slice and cutting it up "x" amount of characters wouldn't work.

Second, underneath each link in a Google search there is the URL in green text. Right clicking that and inspecting the element gives: cite class="_Rm" (between chevrons) which I don't know how to find with goquery because looking for cite with my small function just gives me more long strings of characters.

Here is my small function, it currently does the first option without the parsing and gives me a long string of text that just takes me to the search page:

func GetUrls(url string) {

    doc, err := goquery.NewDocument(url)

    if err != nil {
        panic(err)
    }

    doc.Find(".r").Each(func(i int, s *goquery.Selection) {

        doc.Find(".r a").Each(func(i int, s *goquery.Selection) {
            Link, _ := s.Attr("href")
            Link = url + Link
            fmt.Printf("link is [%s]
", Link)
        })

    })

}
  • 写回答

1条回答 默认 最新

  • 普通网友 2015-06-04 06:27
    关注

    The standard library has support parsing URLs. Check out the net/url package. Using this package, we can get query parameters from URLs.

    Note that your original raw URL contains the URL you want to extract in the "aqs" parameter in the form of

    chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    

    Which is basically another URL.

    Let's write a little helper function which gets a parameter from a raw URL text:

    func getParam(raw, param string) (string, error) {
        u, err := url.Parse(raw)
        if err != nil {
            return "", err
        }
    
        q := u.Query()
        if q == nil {
            return "", fmt.Errorf("No query part")
        }
    
        v := q.Get(param)
        if v == "" {
            return "", fmt.Errorf("Param not found")
        }
        return v, nil
    }
    

    Using this we can get the "aqs" parameter from the original URL, and using this again we can get the "q" parameter which is exactly your desired URL:

    raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA"
    aqs, err := getParam(raw, "aqs")
    if err != nil {
        panic(err)
    }
    fmt.Println(aqs)
    
    result, err := getParam(aqs, "q")
    fmt.Println(result)
    

    Output (try it on the Go Playground):

    chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 sub地址DHCP问题
  • ¥15 delta降尺度计算的一些细节,有偿
  • ¥15 Arduino红外遥控代码有问题
  • ¥15 数值计算离散正交多项式
  • ¥30 数值计算均差系数编程
  • ¥15 redis-full-check比较 两个集群的数据出错
  • ¥15 Matlab编程问题
  • ¥15 训练的多模态特征融合模型准确度很低怎么办
  • ¥15 kylin启动报错log4j类冲突
  • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大