从Google搜索结果页面提取网址

I'm trying to grab all the URLs off of a Google search page and there are two ways I think I could do it, but I don't really have any idea how to do them.

First, I could simply scrape them from the .r tags and get the href attribute for each link. However, this gives me a really long string that I would have to parse through to get the URL. Here's an example of what would have to be parsed through:

https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA

The URL I would want out of this would be:

https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

So I would have to create a string between the https and &sa which I'm not 100% sure how to do because each really long string Google gives me is a different size so just using slice and cutting it up "x" amount of characters wouldn't work.

Second, underneath each link in a Google search there is the URL in green text. Right clicking that and inspecting the element gives: cite class="_Rm" (between chevrons) which I don't know how to find with goquery because looking for cite with my small function just gives me more long strings of characters.

Here is my small function, it currently does the first option without the parsing and gives me a long string of text that just takes me to the search page:

func GetUrls(url string) {

    doc, err := goquery.NewDocument(url)

    if err != nil {
        panic(err)
    }

    doc.Find(".r").Each(func(i int, s *goquery.Selection) {

        doc.Find(".r a").Each(func(i int, s *goquery.Selection) {
            Link, _ := s.Attr("href")
            Link = url + Link
            fmt.Printf("link is [%s]
", Link)
        })

    })

}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

普通网友 2015-06-04 06:27

关注

The standard library has support parsing URLs. Check out the net/url package. Using this package, we can get query parameters from URLs.

Note that your original raw URL contains the URL you want to extract in the "aqs" parameter in the form of

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

Which is basically another URL.

Let's write a little helper function which gets a parameter from a raw URL text:

func getParam(raw, param string) (string, error) {
    u, err := url.Parse(raw)
    if err != nil {
        return "", err
    }

    q := u.Query()
    if q == nil {
        return "", fmt.Errorf("No query part")
    }

    v := q.Get(param)
    if v == "" {
        return "", fmt.Errorf("Param not found")
    }
    return v, nil
}

Using this we can get the "aqs" parameter from the original URL, and using this again we can get the "q" parameter which is exactly your desired URL:

raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA"
aqs, err := getParam(raw, "aqs")
if err != nil {
    panic(err)
}
fmt.Println(aqs)

result, err := getParam(aqs, "q")
fmt.Println(result)

Output (try it on the Go Playground):

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

从Google搜索结果页面提取网址 html
2015-06-04 05:02

回答 1 已采纳 The standard library has support parsing URLs. Check out the net/url package. Using this package,
如何从PHP中获取谷歌搜索结果的最终网址 php
2013-06-05 14:48

回答 2 已采纳 How about this. Little less code. No foreach needed. $url = "http://www.google.com.my/url?sa=t&r
从Google BigQuery提取结果到云存储golang
2017-02-22 10:37

回答 1 已采纳 How do I extract the result of my BigQuery to GCS You cannot directly write the results of a
前端智能化——从图片提取UI样式
2020-09-01 13:00

YoneChen的博客导语：前端智能化，就是通过AI/CV技术，使前端工具链具备理解能力，进而辅助开发提升研发效率，比如实现基于设计稿智能布局和...本文要介绍的是我在前端智能化的实践：基于OpenCV实现自动提取图片中的UI样式的能力。
谷歌浏览器前端js调试问题 chrome javascript 前端
2022-03-10 17:56

回答 4 已采纳破案了，是因为谷歌浏览器最新版本（Version 99.0.4844.51）的原因，点到设置，再点关于chorme,它会自动把浏览器升级到最新版本。我把版本回退到Version 97.0.4692.9
Edge浏览器的搜索引擎变成谷歌的怎么调回去搜索引擎
2022-01-04 13:14

回答 1 已采纳可能不是搜索引擎问题检查电脑的代理如果你能打开百度页面还有网络就说明代理问题根据不同操作系统自己设置下
谷歌导出页面数据的问题 javascript vue.js 前端
2022-08-05 10:08

回答 1 已采纳你导出其实就是下载用 a、from\iframe下载就可以了https://blog.csdn.net/weixin_44058725/article/details/103667552
前端 js实现模糊搜索
2023-02-16 17:21

周亚鑫的博客前端 js实现模糊搜索
如何从谷歌自定义搜索Api获取json响应 php
2016-08-06 13:43

回答 1 已采纳 Here is a sample code. $json = file_get_contents($url); $results = json_decode($json); if ($res
谷歌搜索的API是哪个？搜索引擎
2019-01-13 14:12

回答 1 已采纳 ←如果以下回答对你有帮助，请点击右边的向上箭头及采纳下答案官方文档（部分收费）： https://developers.google.com/custom-search/docs/overvie
谷歌浏览器怎么隐藏搜索记录 java javascript php
2022-10-14 17:06

回答 2 已采纳看下这篇博客，也许你就懂了，链接：谷歌浏览器离线更换皮肤-安装谷歌浏览器插件与问题解决
potato-search:搜索-与谷歌搜索结果相同 | 搜书-帮你找到想要的电子书
2021-05-01 02:56

使用了 Django 框架，表单获取用户输入，视图将用户输入的内容构造成请求URL，向谷歌API进行请求，拿到结果后进行提取，再把提取后的内容交由视图处理后写入模板并返回前端页面。特色功能搜索：搜索结果与谷歌相同...
从Google地图链接中提取纬度和经度 php
2016-09-06 23:57

回答 2 已采纳 I would use a simple RegX like this preg_match('/@(\-?[0-9]+\.[0-9]+),(\-?[0-9]+\.[0-9]+)/', $url
从零开始：前端架构师的基础建设和架构设计之路
2023-12-16 12:08

程序边界的博客前端架构师是现代软件开发中的重要角色。他们需要具备扎实的前端技术基础，以及良好的基础建设和架构设计能力。通过合理的基础建设和架构设计，我们可以构建出稳定、高效、可扩展的前端系统，提高用户体验，提升团队...
2023前端面试题汇总
2023-03-09 17:58

下雪不过冬天的博客 2023前端基础面试题汇总
没有解决我的问题, 去提问

悬赏问题

¥20 sub地址DHCP问题
¥15 delta降尺度计算的一些细节，有偿
¥15 Arduino红外遥控代码有问题
¥15 数值计算离散正交多项式
¥30 数值计算均差系数编程
¥15 redis-full-check比较两个集群的数据出错
¥15 Matlab编程问题
¥15 训练的多模态特征融合模型准确度很低怎么办
¥15 kylin启动报错log4j类冲突
¥15 超声波模块测距控制点灯，灯的闪烁很不稳定，经过调试发现测的距离偏大

码龄粉丝数原力等级 --

从Google搜索结果页面提取网址

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

从Google搜索结果页面提取网址

1条回答 默认 最新

悬赏问题

1条回答默认最新