douyi1084 2014-06-06 10:31
浏览 13
已采纳

查找页面上与页面位于同一域的链接

I am building a crawler which starts from a webpage of website, say example.com, and find all the links on this page which are on same domain.

So suppose we have example.com/hello.php, facebook.com/hello.php on this page. So I only want to list www.example.com/hello.php.

I am using PHP Simple HTML DOM Parser(simplehtmldom.sourceforge.net/).

$html = file_get_html('http://www.example.com/');
// Find all links 
foreach($html->find('a') as $element) {
    $uri = $element->href;
    //Now how to check if $uri belongs to same domain?
}
  • 写回答

1条回答 默认 最新

  • doutongwei4380 2014-06-06 10:52
    关注

    Assuming, all your URLs are already absolute* URLs as in http://example.com/hello.php. Then you'll use parse_url to get the hosts of all your URLs.

    php > $url = "http://example.com/hello.php";
    php > print parse_url($url, PHP_URL_HOST);
    example.com
    

    You now just have to compare the host of the links to the host of the site you're currently crawling. If comparing the hosts is not enough, you have to extract the domains from the hosts. This is not easy, as there is no rule for it. https://www.publicsuffix.org/ has all the information you'll need for this task, though. This includes a PHP URL parser library.

    *(URLs that are not absolute are of course on the same domain, so you don't need them for your decision.)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?