dongzhent208577 2012-12-17 21:17
浏览 121
已采纳

如何获取域上的所有网页

I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain...

e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get: https://stackoverflow.com/questions/ask pulling webpages from an adult site -- how to get past the site agreement? https://stackoverflow.com/questions/1234214/ Best Rails HTML Parser

And all the links. How can I get that. Or is there an API or DIRECTORY that can enable me to get that?

Also is there a way I can get all the subdomains?

Btw how do crawlers crawl websites that don't have SiteMaps or Syndication feeds?

Cheers.

  • 写回答

5条回答 默认 最新

  • dongse3348 2012-12-17 21:21
    关注

    If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site - but this is really up to the owner of the site, and how accessible they make it.

    If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?