如果缺少http：//，则使用url.ResolveReference（）进行错误的URL解析

I've build a web crawler that searches a website for all links on that page and take this links and search on them for more links until the whole page is crawled. Worked perfectly until I came across a special site.

Problem with their linking:

Normal case 1: absolute path like 'http://www.example.com/test'

Normal case 2: relative path like '/test'

Problematic new case: absolute path without the http:// - just 'www.example.com'

Example code that shows the problem:

package main

import (
    "fmt"
    "log"
    "net/url"
)

func main() {

    u, err := url.Parse("http://www.example.com")
    if err != nil {
        log.Fatal(err)
    }
    base, err := url.Parse("http://example.com/directory/")
        if err != nil {
            log.Fatal(err)
        }

    u2, err := url.Parse("www.example.com")
    if err != nil {
        log.Fatal(err)
    }
    base2, err := url.Parse("http://example.com/directory/")
        if err != nil {
            log.Fatal(err)
        }

    fmt.Println(base.ResolveReference(u))
    fmt.Println(base2.ResolveReference(u2))
}

http://www.example.com
http://example.com/test/www.example.com

As you can see the second line gives back a wrong URL because the test for an absolute URL is u.IsAbs() = false if the http:// is missing ...

Any ideas how to fix that? I have to test 100.000 - 1.000.000 links on a daily base, maybe more and it needs to be performant.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douqianrou9079 2016-03-13 04:15
关注
Unfortunately there's no real "fix" for this, because if you get a link with an href like this:

www.example.com

In the general case it's ambiguous between:

http://host.tld/path/to/www.example.com http://www.example.com

In fact, most browsers treat a link like this:

<a href="www.example.com">

As this:

<a href="/current/path/www.example.com">

I'd suggest doing the same (since this is a bug with the person's website), and if you get a 404 just treat it as you would any other.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

悬赏问题

¥20 谁能帮我挨个解读这个php语言编的代码什么意思？
¥15 win10权限管理，限制普通用户使用删除功能
¥15 minnio内存占用过大，内存没被回收（Windows环境）
¥65 抖音咸鱼付款链接转码支付宝
¥15 ubuntu22.04上安装ursim-3.15.8.106339遇到的问题
¥15 blast算法（相关搜索：数据库）
¥15 请问有人会紧聚焦相关的matlab知识嘛？
¥15 网络通信安全解决方案
¥50 yalmip+Gurobi
¥20 win10修改放大文本以及缩放与布局后蓝屏无法正常进入桌面

如果缺少http：//，则使用url.ResolveReference（）进行错误的URL解析

1条回答 默认 最新

悬赏问题

1条回答默认最新