dongzhent208577 2012-12-17 21:17
浏览 121
已采纳

如何获取域上的所有网页

I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain...

e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get: https://stackoverflow.com/questions/ask pulling webpages from an adult site -- how to get past the site agreement? https://stackoverflow.com/questions/1234214/ Best Rails HTML Parser

And all the links. How can I get that. Or is there an API or DIRECTORY that can enable me to get that?

Also is there a way I can get all the subdomains?

Btw how do crawlers crawl websites that don't have SiteMaps or Syndication feeds?

Cheers.

  • 写回答

5条回答 默认 最新

  • dongse3348 2012-12-17 21:21
    关注

    If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site - but this is really up to the owner of the site, and how accessible they make it.

    If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥20 想写一个文件管理器,加载全部子文件夹后,要一级一级返回
  • ¥15 华为超融合部署环境下RedHat虚拟机分区扩容问题
  • ¥15 哪位能做百度地图导航触点播报?
  • ¥15 请问GPT语言模型怎么训练?
  • ¥15 已知平面坐标系(非直角坐标系)内三个点的坐标,反求两坐标轴的夹角
  • ¥15 webots有问题,无响应
  • ¥15 使用VH6501干扰RTR位,CANoe上显示的错误帧不足32个就进入bus off快慢恢复,为什么?
  • ¥15 大智慧怎么编写一个选股程序
  • ¥100 python 调用 cgps 命令获取 实时位置信息
  • ¥15 两台交换机分别是trunk接口和access接口为何无法通信,通信过程是如何?