关于PHP中的Web Crawler的错误

I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.

I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.

I am posting the code below and will try to explain the problems.

private function initiateChildCrawler($parent_Url_Html) {

    global $CFG;
    static $foundLink;
    static $parentID;
    static $urlToCrawl_InstanceOfChildren;

    $forEachCount = 0;
    foreach($parent_Url_Html->getHTML()->find('a') as $foundLink) 
    {
        $forEachCount++;
        if($forEachCount<500) {
        $foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);

        if($this->validateEduDomain($foundLink->href)) 
        {
            //Implement else condition later on
            $parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
            if($parentID != FALSE) 
            {
                if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
                {
                    $urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
                    if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
                    {
                        $this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
                        $this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);

                        /*if($recursiveCount<1)
                        {
                            $this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
                        }*/
                    }
                }
            }
        }
        }
    }   
}

Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.

for eg: 1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler). Moves to the next parent in parentCrawler. 2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).

Other functions are self explanatory.

Now the problem: After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.

But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.

Please help me in this regard.

Message from CLI:

Problem signature: Problem Event Name: APPCRASH Application Name: php-cgi.exe Application Version: 5.3.8.0 Application Timestamp: 4e537939 Fault Module Name: php5ts.dll Fault Module Version: 5.3.8.0 Fault Module Timestamp: 4e537a04 Exception Code: c0000005 Exception Offset: 0000c793 OS Version: 6.1.7601.2.1.0.256.48 Locale ID: 1033 Additional Information 1: 0a9e Additional Information 2: 0a9e372d3b4ad19135b953a78882e789 Additional Information 3: 0a9e Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doulutian4843 2011-12-31 13:31
关注
Flat Loop Example:

You initiate the loop with a stack that contains all URLs you'd like to process first.

Inside the loop:

You shift the first URL (you obtain it and it's removed) from the stack.

If you find new URLs, you add them at the end of the stack (push).

This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:

$URLStack = (array) $parent_Url_Html->getHTML()->find('a'); $URLProcessedCount = 0; while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs { $url = array_shift($URLStack); if (!$url) break; # exit if the stack is empty # process URL # for each new URL: $URLStack[] = $newURL; }

You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:

while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs { $url = $URLStack[$URLProcessedCount++];

Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it's a much more versatile tool.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

关于PHP中的Web Crawler的错误 php
2011-12-31 13:02

回答 1 已采纳 Flat Loop Example: You initiate the loop with a stack that contains all URLs you'd like to proce
如何使用php正则表达式进行描述？ php
2018-06-30 05:45

回答 2 已采纳 Your pattern is incorrect. You start with a / delimiter and then you have an unescaped / in the pa
Index.php作为自定义错误页面 apache php
2011-12-14 00:55

回答 1 已采纳 Usually if there is a 500 status code then Apache has messed something up and it can't run your in
PHPCrawl.rar_PHP CRAWLER_PHPCrawl_crawler_web crawler in PHP_爬虫
2022-09-24 19:00

使用PHP脚本编写的一个网络爬虫，用来抓取对应网站的一些基本信息。
如何有效地处理PHP会话？ php
2017-04-05 08:31

回答 2 已采纳 1) Even when the test lifetime value (60 seconds) is over, the session file remains in the cust
Symfony 3.2：twig / twig v2.0.0需要php ^ 7.0 php symfony
2017-02-23 13:52

回答 1 已采纳 You can use Twig 1.x:. Add/change this in your composer.json: "require": { "symfony/symfony":
PHP简单的HTML DOM解析器 html php
2014-09-05 08:02

回答 6 已采纳 In this case you can directly point it out with children() method. Example: foreach($html->fin
PHP-Crawler:用PHP实现Queue-Producer-Consumer Web Crawler的实现
2021-03-21 23:40

用PHP实现Queue-Producer-Consumer Web Crawler。通过amphp / parallel依赖项使用多个进程或本机线程来爬网域以获取响应者链接。 / _ \ \_\(_)/_/ _//o|\_ / | @作者：罗伯特·伯恩斯@电子邮件：安装使用'...
如何使用PHP从HTML文档中仅提取某些标签？ php
2012-03-27 20:10

回答 3 已采纳 Check out Simple HTML Dom. It will grab external pages and process them with fairly accurate detai
如何使用PHP cURL登录Amazon php
2012-06-16 07:08

回答 2 已采纳 Just for marking as answer: you have to 2 things: Request login page from server using get metho
PHP：如何确定浏览器是否支持PHP中的javascript？ ajax javascript php
2010-11-16 18:08

回答 5 已采纳 You can't do this using PHP. What you can do though is use a noscript tag to redirect to another
WEB-CRAWLER-SIMPLES-EM-PHP:Web爬虫示例
2021-04-08 11:04

PHP中的Web爬虫我在网站上寻找“网站”一词的我所创建的简单WEBCRAWLER 如果您想在其他网站上找到其他单词或其他单词，则可以更改代码
从PHP创建产品 - Magento php sql
2012-08-01 13:05

回答 3 已采纳 $product->setAttributeSetId(9); //9 is for default Are you sure that the default is 9? In ou
pspider:纯 PHP 开发的并行抓取工具 (Parallel web crawler written in PHP)
2021-05-09 05:03

PHP - spider 框架这是最近使用纯 php 代码开发的并行抓取(爬虫)框架，基于组件。您必须先装有，然后在项目里先运行以下命令下载组件： composer install 使用 pspider 这里头的 URL 表管理需要 MySQLi 扩展支持...
php_web_spider:A web crawler written in PHP php网络蜘蛛，信息收集工具A web spider, using php, based on cURL & simple html dom
2021-06-24 03:00

php_web_spider php网络蜘蛛，信息收集工具一个php实现的、基于cURL和simple html dom的轻量级网络爬虫 A web spider, using php, based on cURL & simple html dom. 配置 // 简单配置 cookie文件目录和第三方网页...
crawler：使用PHP实现的易用，功能强大的搜寻器。可以执行Javascript
2021-02-05 14:56

:spider_web: 使用PHP搜寻网页 :spider: 该软件包提供了一个类来爬网网站上的链接。在引擎盖下，Guzzle promises被用来同时。由于搜寻器可以执行JavaScript，因此可以搜寻JavaScript呈现的网站。使用来支持此...
Web-Crawler:DB程序编程师
2021-03-25 18:37

网络爬虫 DB程序编程师
Goutte：Goutte，一个简单PHP Web Scraper
2021-02-18 13:16

Goutte，一个简单PHP Web爬虫Goutte是适用于PHP的屏幕抓取和网络抓取库。 Goutte提供了一个不错的API，可用来抓取网站并从HTML / XML响应中提取数据。要求Goutte依赖于PHP 7.1+。安装在您的composer.json文件中添加...
URL Web Crawler-开源
2021-05-15 08:16

从根本上说，它是一个可以使您成为搜索引擎的程序。它是一个Web爬网程序，具有所有网站源代码（在ASP中，很快也会成为PHP），以及一个mysql数据库。
ACT Web Crawler-开源
2021-05-26 22:00

PHP，mySQL和AJAX Web搜寻器
没有解决我的问题, 去提问

悬赏问题

¥15 对于相关问题的求解与代码
¥15 ubuntu子系统密码忘记
¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
¥15 保护模式-系统加载-段寄存器
¥15 电脑桌面设定一个区域禁止鼠标操作
¥15 求NPF226060磁芯的详细资料
¥15 使用R语言marginaleffects包进行边际效应图绘制
¥20 usb设备兼容性问题
¥15 错误(10048): “调用exui内部功能”库命令的参数“参数4”不能接受空数据。怎么解决啊
¥15 安装svn网络有问题怎么办

关于PHP中的Web Crawler的错误

1条回答 默认 最新

悬赏问题

1条回答默认最新