dotj6816 2011-04-15 15:14
浏览 6
已采纳

递归问题

I'm grabbing links from a website, but I'm having a problem in which the higher I set the recursion depth for the function the results become stranger

for example when I set the function to the following

crawl_page("http://www.mangastream.com/", 10);

I will get a results like this for about half the page

http://mangastream.com/read/naruto/51619850/1/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2

EDIT

while I'm expecting results like this instead

http://mangastream.com/manga/read/naruto/51619850/1

here's the function I've been using to get the results

function crawl_page($url, $depth)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }
    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $href = rtrim($url, '/') . '/' . ltrim($href, '/');
        }
         if(shouldScrape($href)==true)   
          crawl_page($href, $depth - 1);
    }
    echo $url,"";
//,pageStatus($url)
}

any help with this would be greatly appreciated

  • 写回答

2条回答 默认 最新

  • dseax40600 2011-04-15 15:59
    关注

    the construction of your new url is not correct, replace :

    $href = rtrim($url, '/') . '/' . ltrim($href, '/');
    

    with :

    if (substr($href, 0, 1)=='/') {
      // href relative to root
      $info = parse_url($url);
      $href = $info['scheme'].'//'.$info['host'].$href;
    } else {
      // href relative to current path
      $href = rtrim(dirname($url), '/') . '/' . $href;
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么