保持我的DOMDocument / DOMXpath PHP脚本不会占用内存?

Made this script to crawl certain links on a forum and extract the username, post date, and post number.

It works great, the only problem is that it hogs memory and after about a half hour it slows down significantly.

Does anyone have suggestions to speed it up? I've been running a WGET on my server to start the script.

Thanks, Nick

   <?
//this php script is going to download pages and tear them apart from ###

/*
Here's the process:

1. prepare url 
2. get new HTML document from the web
3. extract xpath data
4. input in mysql database
*/


$baseURL="http://www.###.com";

//end viewtopic.php?p=357850
for ($post = 325479; $post <= 357850; $post++) {

//connect to mysql
if (!mysql_connect('localhost','###','###')) echo mysql_error;
mysql_select_db('###');

//check to see if the post is already indexed
$result = mysql_query("SELECT postnumber FROM ### WHERE postnumber = '$post'");
if (mysql_num_rows($result) > 0) {
    //echo "Already in the database." . "<br>";
    mysql_close();
    continue;
}

$url=$baseURL."/viewtopic.php?p=".$post;
//echo $url."<br>";

//get new HTML document
$html = new DOMDocument(); 
$html->loadHTMLFile($url);

$xpath = new DOMXpath($html);

//select the page elements that you want
//I want the parent of the TD class = forumRow
$links = $xpath->query( "//td[@class='forumRow']/parent::tr" ); 

    foreach($links as $results){
        $newDom = new DOMDocument;
        $newDom->appendChild($newDom->importNode($results,true));

        $xpath = new DOMXpath ($newDom);

        //which parts of the selection do you want?
        $time_stamp = trim($xpath->query("//td[2]/table/tr/td/span")->item(0)->nodeValue);
        $user_name = trim($xpath->query("//a[@class='genmed']")->item(0)->nodeValue);
        $post_number = trim($xpath->query("//td/a/@name")->item(0)->nodeValue);

        $return[] = array(
            'time_stamp' => $time_stamp,
            'username' => $user_name,
            'post_number' => $post_number,
            );
    }

    foreach ($return as $output) {
        if (strlen($output['time_stamp']) > 0 && strlen($output['username']) > 0) 
          {
          //$timestamp = substr($output['time_stamp'],8,25);
          //echo $timestamp . "<br>";
          //$unixtimestamp = strtotime($timestamp);
          //echo $unixtimestamp;
          //echo $output['time_stamp']."<br>";
          preg_match("/[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec]{3} \d{1,2}[,] \d{4} \d{1,2}[:]\d{2}/", $output['time_stamp'],$matches). "<br>";
          $unixtimestamp = strtotime($matches[0]);

          //YYYY-MM-DD HH:MM:SS
          $phpdate=date("Y-m-d H:i:s",$unixtimestamp);
          $username=$output['username'];
          $post_number=$output['post_number'];
          //echo $phpdate ." by ". $username . " #" . $post_number ;

          $result = mysql_query("SELECT postnumber FROM ### WHERE postnumber = '$post_number'");
          if (mysql_num_rows($result) == 0) {         
            if (mysql_query("INSERT INTO ### VALUES('','$url','$username','$phpdate','$post_number')")) echo "Y";
            else echo "N";
            mysql_close();
          }
          echo "<br>";
          }
    }
}
?>
php
doumuyu0837
doumuyu0837 关于mysql_free_result()的建议。也就是说,如果结果集耗尽了你的记忆。
接近 10 年之前 回复
doupingzhi9674
doupingzhi9674 使用分析器,例如XDebug或手动:de.php.net/manual/en/function.memory-get-usage.php
接近 10 年之前 回复
du6jws6975
du6jws6975 有人告诉我,打开一个新连接并在每个帖子后关闭可能会释放内存。不是这样吗?
接近 10 年之前 回复
ds122455
ds122455 我特别使用php5,“PHPVersion5.3.2-1ubuntu4.2”。
接近 10 年之前 回复
duanlu1908
duanlu1908 如何识别代码的哪些部分会导致内存增加?
接近 10 年之前 回复
dtu11716
dtu11716 如果问题是DOM(整个文档需要加载并且树是构建的,那么是的,如果文档很大,这可以使用一些内存。在另一个站点上,我们在这里谈论网站),你应该考虑另一个XML解析器,如SAX或像XMLReader这样的Pull解析器:php.net/manual/en/intro.xmlreader.php
接近 10 年之前 回复
dongtiran7769
dongtiran7769 (1)为什么要为每个帖子打开与DB的新连接?(2)/[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec]{3}...不(仅)匹配您的想法
接近 10 年之前 回复
duanhu7615
duanhu7615 您使用的是哪个版本的PHP?
接近 10 年之前 回复
dongsibao8977
dongsibao8977 你是否已经确定代码的哪些部分会导致内存增加?
接近 10 年之前 回复

1个回答



您可能需要查看了mysql_free_result 。 此外,您通过整个脚本维护$ return数组的事实没有帮助。 如果你想避免内存问题,你应该抓取十几条记录,插入它们,重置$ return,再抓一打,插入,重置......等等。 否则,$ return数组变得很大,这可能是你问题的原因之一(如果不是原因)。</ p>
</ div>

展开原文

原文

You might want to take a look at mysql_free_result. Also, the fact that you are maintaining a $return array thorough the whole script doesn't help. If you want to avoid memory issues, you should crawl a dozen records, insert them, reset $return, crawl a dozen more, insert, reset... and so on. Otherwise, the $return array gets huge, and that's probably is one of the causes (if not the cause) of your problem.

drt41563
drt41563 谢谢你肯定有帮助。
接近 10 年之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问