duanhui3759 2015-04-04 02:41
浏览 91

我正在使用的PHP爬虫有内存泄漏,是什么导致这个?

I am using a PHP crawler that has a memory leak. It is good for the first ~3125 links, then it runs out of memory.I tried getting rid of the MySQL insert, but that did not change anything. Can someone help me diagnose this problem? Thank you so much.

<?php
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
ini_set('max_execution_time', 0);
// USAGE
$startURL = $your_url;
$depth = 9999;
$crawler = new crawler($startURL, $depth);
// Exclude path with the following structure to be processed 
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_seen = array();
protected $_filter = array();

public function __construct($url, $depth = 5)
{
    $this->_url = $url;
    $this->_depth = $depth;
    $parse = parse_url($url);
    $this->_host = $parse['host'];
}

protected function _processAnchors($content, $url, $depth)
{
    $dom = new DOMDocument('1.0');
    @$dom->loadHTML($content);
    $anchors = $dom->getElementsByTagName('a');

    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        // Crawl only link that belongs to the start domain
        $this->crawl_page($href, $depth - 1);
    }
}


protected function _getContent($url)
{
    $handle = curl_init($url);


    curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);


    $response = curl_exec($handle);

    curl_close($handle);
    return array($response);
}

protected function _printResult($url, $depth)
{
    ob_end_flush();
    $currentDepth = $this->_depth - $depth;
    $count = count($this->_seen);
    echo "$url <br>";
    include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
    $databaseconnect = new  PDO("mysql:dbname=DB_NAME;host=$mysqlhost;charset=utf8", $mysqlusername, $mysqlpassword); 
    $statement = $databaseconnect->prepare("INSERT INTO data(url,name) VALUES(:url,:name)");
    $statement->execute(array(':url' => $url,
    ':name' => $url));
    ob_start();
    flush();
}

protected function isValid($url, $depth)
{
    if (strpos($url, $this->_host) === false
        || $depth === 0
        || isset($this->_seen[$url])
    ) {
        return false;
    }
    foreach ($this->_filter as $excludePath) {
        if (strpos($url, $excludePath) !== false) {
            return false;
        }
    }
    return true;
}

public function crawl_page($url, $depth)
{
    if (!$this->isValid($url, $depth)) {
        return;
    }
    // add to the seen URL
    $this->_seen[$url] = true;
    // get Content and Return Code
    list($content) = $this->_getContent($url);
    // print Result for current Page
    $this->_printResult($url, $depth);
    // process subPages
    $this->_processAnchors($content, $url, $depth);
}

public function addFilterPath($path)
{
    $this->_filter[] = $path;
}

public function run()
{
    $this->crawl_page($this->_url
    , $this->_depth);
}
}
?>
  • 写回答

1条回答 默认 最新

  • douzhuan1467 2015-04-04 02:59
    关注

    I'm not sure if this classifies as a memory leak exactly. You are essentially using recursion without a terminating case. Before the crawl_page() method finishes it calls _processAnchors(), which in turn may call crawl_page() again if it finds any links (very likely). Every recursive call eats up more memory because the originating crawl_page() call (and most thereafter) can't be removed from the call stack until all of its recursive calls terminate.

    评论

报告相同问题?

悬赏问题

  • ¥15 多址通信方式的抗噪声性能和系统容量对比
  • ¥15 winform的chart曲线生成时有凸起
  • ¥15 msix packaging tool打包问题
  • ¥15 finalshell节点的搭建代码和那个端口代码教程
  • ¥15 用hfss做微带贴片阵列天线的时候分析设置有问题
  • ¥15 Centos / PETSc / PETGEM
  • ¥15 centos7.9 IPv6端口telnet和端口监控问题
  • ¥20 完全没有学习过GAN,看了CSDN的一篇文章,里面有代码但是完全不知道如何操作
  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 海浪数据 南海地区海况数据,波浪数据