dousao6313 2014-04-10 12:12
浏览 49

禁用外部链接的页面URL检查PHP链接爬虫

I have created a standalone link crawler script for finding the broken links in the site using the following script http://phpcrawl.cuab.de/example.html.

Its working fine to crawl the links. but it check the external link and its content page urls also. but this process is not needed only check the internal link , internal link's content page url and external link. does not want to check the external links content page url. So i need to disable the checking of the external link's content page url and its imge src. only check the external link is broken or not. dont check that link's content page url.

  • 写回答

2条回答 默认 最新

  • doupingyun73833 2014-04-10 12:18
    关注

    If you read the documentation for the framework you are using you would have found the addURLFollowRule() method that can force the crawler to only follow specific URL-patterns.

    Add this to your code and apply the correct REGEX pattern to match your interal URL(s):

    $crawler->addURLFollowRule("#https?://internal/.*# i");
    

    Documentation: http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_addURLFollowRule.htm

    评论

报告相同问题?

悬赏问题

  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题