I've build a web crawler that searches a website for all links on that page and take this links and search on them for more links until the whole page is crawled. Worked perfectly until I came across a special site.
Problem with their linking:
Normal case 1: absolute path like 'http://www.example.com/test'
Normal case 2: relative path like '/test'
Problematic new case: absolute path without the http:// - just 'www.example.com'
Example code that shows the problem:
package main
import (
"fmt"
"log"
"net/url"
)
func main() {
u, err := url.Parse("http://www.example.com")
if err != nil {
log.Fatal(err)
}
base, err := url.Parse("http://example.com/directory/")
if err != nil {
log.Fatal(err)
}
u2, err := url.Parse("www.example.com")
if err != nil {
log.Fatal(err)
}
base2, err := url.Parse("http://example.com/directory/")
if err != nil {
log.Fatal(err)
}
fmt.Println(base.ResolveReference(u))
fmt.Println(base2.ResolveReference(u2))
}
http://www.example.com
http://example.com/test/www.example.com
As you can see the second line gives back a wrong URL because the test for an absolute URL is u.IsAbs() = false if the http:// is missing ...
Any ideas how to fix that? I have to test 100.000 - 1.000.000 links on a daily base, maybe more and it needs to be performant.