dongyun7897 2015-06-04 05:02
浏览 100
已采纳

从Google搜索结果页面提取网址

I'm trying to grab all the URLs off of a Google search page and there are two ways I think I could do it, but I don't really have any idea how to do them.

First, I could simply scrape them from the .r tags and get the href attribute for each link. However, this gives me a really long string that I would have to parse through to get the URL. Here's an example of what would have to be parsed through:

https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA

The URL I would want out of this would be:

https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

So I would have to create a string between the https and &sa which I'm not 100% sure how to do because each really long string Google gives me is a different size so just using slice and cutting it up "x" amount of characters wouldn't work.

Second, underneath each link in a Google search there is the URL in green text. Right clicking that and inspecting the element gives: cite class="_Rm" (between chevrons) which I don't know how to find with goquery because looking for cite with my small function just gives me more long strings of characters.

Here is my small function, it currently does the first option without the parsing and gives me a long string of text that just takes me to the search page:

func GetUrls(url string) {

    doc, err := goquery.NewDocument(url)

    if err != nil {
        panic(err)
    }

    doc.Find(".r").Each(func(i int, s *goquery.Selection) {

        doc.Find(".r a").Each(func(i int, s *goquery.Selection) {
            Link, _ := s.Attr("href")
            Link = url + Link
            fmt.Printf("link is [%s]
", Link)
        })

    })

}
  • 写回答

1条回答 默认 最新

  • 普通网友 2015-06-04 06:27
    关注

    The standard library has support parsing URLs. Check out the net/url package. Using this package, we can get query parameters from URLs.

    Note that your original raw URL contains the URL you want to extract in the "aqs" parameter in the form of

    chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    

    Which is basically another URL.

    Let's write a little helper function which gets a parameter from a raw URL text:

    func getParam(raw, param string) (string, error) {
        u, err := url.Parse(raw)
        if err != nil {
            return "", err
        }
    
        q := u.Query()
        if q == nil {
            return "", fmt.Errorf("No query part")
        }
    
        v := q.Get(param)
        if v == "" {
            return "", fmt.Errorf("Param not found")
        }
        return v, nil
    }
    

    Using this we can get the "aqs" parameter from the original URL, and using this again we can get the "q" parameter which is exactly your desired URL:

    raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA"
    aqs, err := getParam(raw, "aqs")
    if err != nil {
        panic(err)
    }
    fmt.Println(aqs)
    
    result, err := getParam(aqs, "q")
    fmt.Println(result)
    

    Output (try it on the Go Playground):

    chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 yolov9的训练时间
  • ¥15 二叉树遍历没有报错但无法正常运行
  • ¥15 在linux系统下vscode运行robocup3d上场球员报错
  • ¥15 Python语言实验
  • ¥15 SAP HANA SQL 增加合计行
  • ¥20 用C#语言解决一个英文打字练习器,有偿
  • ¥15 srs-sip外部服务 webrtc支持H265格式
  • ¥15 在使用abaqus软件中,继承到assembly里的surfaces怎么使用python批量调动
  • ¥15 大一C语言期末考试,求帮助🙏🙏
  • ¥15 ch340驱动未分配COM