douxu5233 2013-03-07 22:50
浏览 28
已采纳

编写一个打开网站页面的PHP脚本,并将页面内容存储在变量中

I have been building a search-engine, but now I need a web crawler that in PHP that can crawl my website for it's content.

I don't know if a web crawler / spider is the right word, but I was hoping and wondering if anyone could help me write a simple PHP script that opens all pages in a domain ending in .php or .html and takes the content in the pages and stores that in a variable as raw text. One variable per page.

If anyone knows of a good and open source script that does this or can help me write one, please share or do so— I would greatly appreciate all and any help.

  • 写回答

1条回答 默认 最新

  • drkxgs9358 2013-03-07 23:08
    关注

    Check out http://sourceforge.net/projects/php-crawler/

    Or try this simple code that searches for the presence of the Google Analytics tracking code:

    // Disable time limit to keep the script running
    set_time_limit(0);
    // Domain to start crawling
    $domain = "http://webdevwonders.com";
    // Content to search for existence
    $content = "google-analytics.com/ga.js";
    // Tag in which you look for the content
    $content_tag = "script";
    // Name of the output file
    $output_file = "analytics_domains.txt";
    // Maximum urls to check
    $max_urls_to_check = 100;
    $rounds = 0;
    // Array to hold all domains to check
    $domain_stack = array();
    // Maximum size of domain stack
    $max_size_domain_stack = 1000;
    // Hash to hold all domains already checked
    $checked_domains = array();
    
    // Loop through the domains as long as domains are available in the stack
    // and the maximum number of urls to check is not reached
    while ($domain != "" && $rounds < $max_urls_to_check) {
        $doc = new DOMDocument();
    
        // Get the sourcecode of the domain
        @$doc->loadHTMLFile($domain);
        $found = false;
    
        // Loop through each found tag of the specified type in the dom
        // and search for the specified content
        foreach($doc->getElementsByTagName($content_tag) as $tag) {
            if (strpos($tag->nodeValue, $content)) {
                $found = true;
                break;
            }
        }
    
        // Add the domain to the checked domains hash
        $checked_domains[$domain] = $found;
        // Loop through each "a"-tag in the dom
        // and add its href domain to the domain stack if it is not an internal link
        foreach($doc->getElementsByTagName('a') as $link) {
            $href = $link->getAttribute('href');
            if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
                $href_array = explode("/", $href);
                // Keep the domain stack to the predefined max of domains
                // and only push domains to the stack that have not been checked yet
                if (count($domain_stack) < $max_size_domain_stack &&
                    $checked_domains["http://".$href_array[2]] === null) {
                    array_push($domain_stack, "http://".$href_array[2]);
                }
            };
        }
    
        // Remove all duplicate urls from stack
        $domain_stack = array_unique($domain_stack);
        $domain = $domain_stack[0];
        // Remove the assigned domain from domain stack
        unset($domain_stack[0]);
        // Reorder the domain stack
        $domain_stack = array_values($domain_stack);
        $rounds++;
    }
    
    $found_domains = "";
    // Add all domains where the specified search string
    // has been found to the found domains string
    foreach ($checked_domains as $key => $value) {
        if ($value) {
            $found_domains .= $key."
    ";
        }
    }
    
    // Write found domains string to specified output file
    file_put_contents($output_file, $found_domains);
    

    I found it here.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 无线电能传输系统MATLAB仿真问题
  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题