dotj6816 2011-04-15 15:14
浏览 6
已采纳

递归问题

I'm grabbing links from a website, but I'm having a problem in which the higher I set the recursion depth for the function the results become stranger

for example when I set the function to the following

crawl_page("http://www.mangastream.com/", 10);

I will get a results like this for about half the page

http://mangastream.com/read/naruto/51619850/1/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2

EDIT

while I'm expecting results like this instead

http://mangastream.com/manga/read/naruto/51619850/1

here's the function I've been using to get the results

function crawl_page($url, $depth)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }
    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $href = rtrim($url, '/') . '/' . ltrim($href, '/');
        }
         if(shouldScrape($href)==true)   
          crawl_page($href, $depth - 1);
    }
    echo $url,"";
//,pageStatus($url)
}

any help with this would be greatly appreciated

  • 写回答

2条回答 默认 最新

  • dseax40600 2011-04-15 15:59
    关注

    the construction of your new url is not correct, replace :

    $href = rtrim($url, '/') . '/' . ltrim($href, '/');
    

    with :

    if (substr($href, 0, 1)=='/') {
      // href relative to root
      $info = parse_url($url);
      $href = $info['scheme'].'//'.$info['host'].$href;
    } else {
      // href relative to current path
      $href = rtrim(dirname($url), '/') . '/' . $href;
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角
  • ¥20 Java-Oj-桌布的计算
  • ¥15 powerbuilder中的datawindow数据整合到新的DataWindow
  • ¥20 有人知道这种图怎么画吗?
  • ¥15 pyqt6如何引用qrc文件加载里面的的资源
  • ¥15 安卓JNI项目使用lua上的问题
  • ¥20 RL+GNN解决人员排班问题时梯度消失
  • ¥60 要数控稳压电源测试数据
  • ¥15 能帮我写下这个编程吗
  • ¥15 ikuai客户端l2tp协议链接报终止15信号和无法将p.p.p6转换为我的l2tp线路