我正在使用的PHP爬虫有内存泄漏，是什么导致这个？

I am using a PHP crawler that has a memory leak. It is good for the first ~3125 links, then it runs out of memory.I tried getting rid of the MySQL insert, but that did not change anything. Can someone help me diagnose this problem? Thank you so much.

<?php
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
ini_set('max_execution_time', 0);
// USAGE
$startURL = $your_url;
$depth = 9999;
$crawler = new crawler($startURL, $depth);
// Exclude path with the following structure to be processed 
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_seen = array();
protected $_filter = array();

public function __construct($url, $depth = 5)
{
    $this->_url = $url;
    $this->_depth = $depth;
    $parse = parse_url($url);
    $this->_host = $parse['host'];
}

protected function _processAnchors($content, $url, $depth)
{
    $dom = new DOMDocument('1.0');
    @$dom->loadHTML($content);
    $anchors = $dom->getElementsByTagName('a');

    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        // Crawl only link that belongs to the start domain
        $this->crawl_page($href, $depth - 1);
    }
}


protected function _getContent($url)
{
    $handle = curl_init($url);


    curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);


    $response = curl_exec($handle);

    curl_close($handle);
    return array($response);
}

protected function _printResult($url, $depth)
{
    ob_end_flush();
    $currentDepth = $this->_depth - $depth;
    $count = count($this->_seen);
    echo "$url <br>";
    include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
    $databaseconnect = new  PDO("mysql:dbname=DB_NAME;host=$mysqlhost;charset=utf8", $mysqlusername, $mysqlpassword); 
    $statement = $databaseconnect->prepare("INSERT INTO data(url,name) VALUES(:url,:name)");
    $statement->execute(array(':url' => $url,
    ':name' => $url));
    ob_start();
    flush();
}

protected function isValid($url, $depth)
{
    if (strpos($url, $this->_host) === false
        || $depth === 0
        || isset($this->_seen[$url])
    ) {
        return false;
    }
    foreach ($this->_filter as $excludePath) {
        if (strpos($url, $excludePath) !== false) {
            return false;
        }
    }
    return true;
}

public function crawl_page($url, $depth)
{
    if (!$this->isValid($url, $depth)) {
        return;
    }
    // add to the seen URL
    $this->_seen[$url] = true;
    // get Content and Return Code
    list($content) = $this->_getContent($url);
    // print Result for current Page
    $this->_printResult($url, $depth);
    // process subPages
    $this->_processAnchors($content, $url, $depth);
}

public function addFilterPath($path)
{
    $this->_filter[] = $path;
}

public function run()
{
    $this->crawl_page($this->_url
    , $this->_depth);
}
}
?>

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douzhuan1467 2015-04-04 02:59
关注
I'm not sure if this classifies as a memory leak exactly. You are essentially using recursion without a terminating case. Before the crawl_page() method finishes it calls _processAnchors(), which in turn may call crawl_page() again if it finds any links (very likely). Every recursive call eats up more memory because the originating crawl_page() call (and most thereafter) can't be removed from the call stack until all of its recursive calls terminate.

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

爬虫相关，php如何实现用cookie实现扫码登录？ php 前端有问必答爬虫
2021-09-28 22:32

回答 3 已采纳先要确认二维码存储的是什么内容？如果是网址并且是要爬取网站的网址而不是居于第三方登录的，可以先下载二维码，同时注意保存返回的cookie信息，然后用对应的php二维码类库解析出二维码地址，curl请求
爬虫当中的Ajax是什么？ Request payload是什么？ python 爬虫
2022-09-20 16:14

回答 3 已采纳 AJAX 是一种用于创建快速动态网页的技术。AJAX 通过在后台与服务器进行少量数据交换，使网页实现异步更新。这意味着可以在不重载整个页面的情况下，对网页的某些部分进行更新。传统的网页（不使用 AJA
python 爬虫运行之后显示这个，这是什么意思？ python
2019-06-30 08:15

回答 1 已采纳这个是跳转的新地址，你根据window.location.href=的那个字符串，接着去爬这个页面
php爬虫保存cookies,Python爬虫利用cookie实现模拟登陆实例详解
2021-03-24 01:04

慧小田哲思学的博客 Cookie，指某些网站为了辨别用户身份、进行session跟踪而储存在用户本地终端上的数据(通常经过加密)。举个例子，某些网站是需要登录后才能得到你想要的信息的，不登陆只能是游客模式，...我之前的帖子中使用过urlop...
为什么我的python爬虫有一些非动态的数据爬不到？ python 爬虫
2023-01-17 16:42

回答 8 已采纳有None不是很正常吗你看页面上有没有啊。还有不知道你想获取的事那个页面的数据
爬虫的本质思路是什么？ java python 爬虫
2022-08-05 09:31

回答 3 已采纳就是模拟浏览器或者软件发出请求
PHP爬虫，两个不同页面 php 有问必答
2021-08-28 15:30

回答 1 已采纳数组形式返回就行，或者json格式的数据。有帮助麻烦点个采纳【本回答右上角】，谢谢~~ <meta charset="utf-8"> <?php function crawl(){
php爬虫框架phpfetcher,TrackRay：打造一款自己的渗透测试框架
2021-03-18 00:43

祖国信仰不可辜负的博客溯光是一个开源的插件化渗透测试框架，框架自身实现了漏洞扫描功能，集成了知名安全工具：Metasploit、Nmap、Sqlmap、AWVS等。溯光使用 Java 编写，SpringBoot 作为基础框架，JPA + HSQLDB嵌入式数据库做持久化，...
PHP如何采集指定的数据(爬虫)？ php
2021-07-08 12:54

回答 1 已采纳请求头中有个必须的参数：hexin-v，并且它有一定时效性，大概20分钟就会失效，这个需要继续研究。其他都是一些正常查询参数 $client = new \GuzzleHttp\Client([
爬虫…显示的这是什么编码？ python 有问必答
2021-07-08 15:54

回答 2 已采纳在浏览器打开网页信息看看是这个编码，如果不是，说明你爬取的数据乱码了查看一下网页编码方式
我的python爬虫的循环遍历为什么报错？ python 爬虫
2023-01-17 13:25

回答 5 已采纳你把res打印出来看下。应该是request.get方法没有获取到数据，或者获取的数据格式和你想要的不一致，所以报错了。如果是获取的数据不对，检查下您的请求参数是否正确。
php 配置在内存溢出,php 内存溢出，解决方案
2021-03-23 20:11

Yutin俞廷的博客背景之前有写一个php 爬虫定时任务，发现系统的内存飞快的上涨，直觉告诉我应该是php内存泄漏惹得祸。于是用。killall php杀掉了php 的进程。从内存占用78%，变成了20%解决方案方案一像上面描述。在凌晨的时候把进程...
py爬虫入门，这是编码问题吗？爬虫
2022-07-15 14:54

回答 1 已采纳页面的编码格式可能不是utf8，你指定保存为utf8格式导致乱码
linux注入内存泄露,Linux 内存泄露小结
2021-05-15 02:38

唐晓山的博客内存泄露一般是由于在申请、释放内存的过程中，并没有将其正确的结对使用。出现了申请了内存，但是未释放或者少释放了内存的情况。内存泄露问题的出现，可能短时间内不会造成很大的影响。但是如果长...
最好的语言 PHP + 最好的前端测试框架 Selenium = 最好的爬虫
2018-01-31 17:27

杨西瓜的博客入职冰鉴科技做爬虫开发已经半年多了，陆续开发维护了几个爬虫以后终于在 web 端爬虫这一块有了登堂入室的感觉。中间踩了许多坑，也对爬虫的许多细节有了自己的认识，所以今天希望能分享一些爬虫经验。虽然爬虫的很...
没有解决我的问题, 去提问

悬赏问题

¥15 多址通信方式的抗噪声性能和系统容量对比
¥15 winform的chart曲线生成时有凸起
¥15 msix packaging tool打包问题
¥15 finalshell节点的搭建代码和那个端口代码教程
¥15 用hfss做微带贴片阵列天线的时候分析设置有问题
¥15 Centos / PETSc / PETGEM
¥15 centos7.9 IPv6端口telnet和端口监控问题
¥20 完全没有学习过GAN，看了CSDN的一篇文章，里面有代码但是完全不知道如何操作
¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
¥20 海浪数据南海地区海况数据，波浪数据

我正在使用的PHP爬虫有内存泄漏，是什么导致这个？

1条回答 默认 最新

悬赏问题

1条回答默认最新