dsuoedtom207012191 2010-11-17 13:37
浏览 48
已采纳

保持我的DOMDocument / DOMXpath PHP脚本不会占用内存?

Made this script to crawl certain links on a forum and extract the username, post date, and post number.

It works great, the only problem is that it hogs memory and after about a half hour it slows down significantly.

Does anyone have suggestions to speed it up? I've been running a WGET on my server to start the script.

Thanks, Nick

   <?
//this php script is going to download pages and tear them apart from ###

/*
Here's the process:

1. prepare url 
2. get new HTML document from the web
3. extract xpath data
4. input in mysql database
*/


$baseURL="http://www.###.com";

//end viewtopic.php?p=357850
for ($post = 325479; $post <= 357850; $post++) {

//connect to mysql
if (!mysql_connect('localhost','###','###')) echo mysql_error;
mysql_select_db('###');

//check to see if the post is already indexed
$result = mysql_query("SELECT postnumber FROM ### WHERE postnumber = '$post'");
if (mysql_num_rows($result) > 0) {
    //echo "Already in the database." . "<br>";
    mysql_close();
    continue;
}

$url=$baseURL."/viewtopic.php?p=".$post;
//echo $url."<br>";

//get new HTML document
$html = new DOMDocument(); 
$html->loadHTMLFile($url);

$xpath = new DOMXpath($html);

//select the page elements that you want
//I want the parent of the TD class = forumRow
$links = $xpath->query( "//td[@class='forumRow']/parent::tr" ); 

    foreach($links as $results){
        $newDom = new DOMDocument;
        $newDom->appendChild($newDom->importNode($results,true));

        $xpath = new DOMXpath ($newDom);

        //which parts of the selection do you want?
        $time_stamp = trim($xpath->query("//td[2]/table/tr/td/span")->item(0)->nodeValue);
        $user_name = trim($xpath->query("//a[@class='genmed']")->item(0)->nodeValue);
        $post_number = trim($xpath->query("//td/a/@name")->item(0)->nodeValue);

        $return[] = array(
            'time_stamp' => $time_stamp,
            'username' => $user_name,
            'post_number' => $post_number,
            );
    }

    foreach ($return as $output) {
        if (strlen($output['time_stamp']) > 0 && strlen($output['username']) > 0) 
          {
          //$timestamp = substr($output['time_stamp'],8,25);
          //echo $timestamp . "<br>";
          //$unixtimestamp = strtotime($timestamp);
          //echo $unixtimestamp;
          //echo $output['time_stamp']."<br>";
          preg_match("/[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec]{3} \d{1,2}[,] \d{4} \d{1,2}[:]\d{2}/", $output['time_stamp'],$matches). "<br>";
          $unixtimestamp = strtotime($matches[0]);

          //YYYY-MM-DD HH:MM:SS
          $phpdate=date("Y-m-d H:i:s",$unixtimestamp);
          $username=$output['username'];
          $post_number=$output['post_number'];
          //echo $phpdate ." by ". $username . " #" . $post_number ;

          $result = mysql_query("SELECT postnumber FROM ### WHERE postnumber = '$post_number'");
          if (mysql_num_rows($result) == 0) {         
            if (mysql_query("INSERT INTO ### VALUES('','$url','$username','$phpdate','$post_number')")) echo "Y";
            else echo "N";
            mysql_close();
          }
          echo "<br>";
          }
    }
}
?>
  • 写回答

1条回答 默认 最新

  • douxiluan6555 2010-11-17 13:50
    关注

    You might want to take a look at mysql_free_result. Also, the fact that you are maintaining a $return array thorough the whole script doesn't help. If you want to avoid memory issues, you should crawl a dozen records, insert them, reset $return, crawl a dozen more, insert, reset... and so on. Otherwise, the $return array gets huge, and that's probably is one of the causes (if not the cause) of your problem.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集
  • ¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)(相关搜索:51单片机|单片机|测试代码)
  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题
  • ¥30 ros小车路径规划实现不了,如何解决?(操作系统-ubuntu)