I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path. I took 2 test sites. One in ROR and 1 made using Pyro CMS.
In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz
, then my crawler will append and make it http://example.com/xyz/index.php
. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php
. So if I crawl http://example.com/xyz/index.php
, I'll find another "index.php" which gets appended again.
While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.
I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.