doujuanxun7167 2015-09-17 00:38
浏览 120

从相对路径解析绝​​对路径

I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path. I took 2 test sites. One in ROR and 1 made using Pyro CMS.

In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz, then my crawler will append and make it http://example.com/xyz/index.php. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php. So if I crawl http://example.com/xyz/index.php, I'll find another "index.php" which gets appended again.

While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.

I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.

  • 写回答

1条回答 默认 最新

  • doupai8533 2015-09-17 06:26
    关注

    In Go, package path is your friend.

    You can get the directory or folder from a path with path.Dir(), e.g.

    p := "/xyz/index.php"
    dir := path.Dir(p)
    fmt.Println("dir:", dir) // Output: "/xyz"
    

    If you find a link with root path (starts with a slash), you can use that as-is.

    If it is relative, you can join it with the dir above using path.Join(). Join() will also "clean" the url:

    p2 := path.Join(dir, "index.php")
    fmt.Println("p2:", p2)
    p3 := path.Join(dir, "./index.php")
    fmt.Println("p3:", p3)
    p4 := path.Join(dir, "../index.php")
    fmt.Println("p4:", p4)
    

    Output:

    p2: /xyz/index.php
    p3: /xyz/index.php
    p4: /index.php
    

    The "cleaning" tasks performed by path.Join() are done by path.Clean() which you can manually call on any path of course. They are:

    1. Replace multiple slashes with a single slash.
    2. Eliminate each . path name element (the current directory).
    3. Eliminate each inner .. path name element (the parent directory) along with the non-.. element that precedes it.
    4. Eliminate .. elements that begin a rooted path: that is, replace "/.." by "/" at the beginning of a path.

    And if you have a "full" url (with schema, host, etc.), you can use the url.Parse() function to obtain a url.URL value from the raw url string which tokenizes the url for you, so you can get the path like this:

    uraw := "http://example.com/xyz/index.php"
    u, err := url.Parse(uraw)
    if err != nil {
        fmt.Println("Invalid url:", err)
    }
    fmt.Println("Path:", u.Path)
    

    Output:

    Path: /xyz/index.php
    

    Try all the examples on the Go Playground.

    评论

报告相同问题?

悬赏问题

  • ¥15 IAR程序莫名变量多重定义
  • ¥15 (标签-UDP|关键词-client)
  • ¥15 关于库卡officelite无法与虚拟机通讯的问题
  • ¥15 qgcomp混合物线性模型分析的代码出现错误:Model aliasing occurred
  • ¥100 已有python代码,要求做成可执行程序,程序设计内容不多
  • ¥15 目标检测项目无法读取视频
  • ¥15 GEO datasets中基因芯片数据仅仅提供了normalized signal如何进行差异分析
  • ¥100 求采集电商背景音乐的方法
  • ¥15 数学建模竞赛求指导帮助
  • ¥15 STM32控制MAX7219问题求解答