du512926 2012-07-07 21:27
浏览 46

我如何通过PHP查找所有URL?

I'm wondering if there's some code or library for getting all urls under a domain. I need to find all urls for a domain.

For example, if my domain is https://stackoverflow.com/ and I'd like to find all question url's like this:

  1. [Java lib or app to convert CSV to XML file?
  2. [https://stackoverflow.com/questions/456/what-can-i]
  3. [https://stackoverflow.com/questions/789/where-can-i]

I don't know about how many questions are under the domain, but I have to create an engine for searching all the urls and then after finding the urls I need to insert the content into my database.

I will create a small search engine for my 5 web pages.

Can anyone help please?

Thanks,

  • 写回答

1条回答 默认 最新

  • doufei6456 2012-07-07 21:38
    关注

    Lucene search allows you to easily index your pages so they can be searched efficiently and accurately.

    See Zend_Search_Lucene for a PHP implementation of Lucene serach.

    You still have to spider your site and build the index which is another issue. You could use a software like Teleport Pro to spider your site and give you a list of URLs which you can then feed to a PHP script that gets the contents of all the pages and feeds them to Zend_Search_Lucene to build an index. You can also write the crawler in PHP or use an existing solution. A search for php crawler yields many things, including this useful php crawler.

    评论

报告相同问题?

悬赏问题

  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥15 帮我写一个c++工程
  • ¥30 Eclipse官网打不开,官网首页进不去,显示无法访问此页面,求解决方法
  • ¥15 关于smbclient 库的使用
  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?