dqz7636 2015-09-17 04:19
浏览 142

无法使用php fgetcsv读取巨大的CSV文件 - 了解内存消耗

Good morning, I´m actually going through some hard lessons while trying to handle huge csv files up to 4GB.

Goal is to search some items in a csv file (Amazon datafeed) by a given browsenode and also by some given item id´s (ASIN). To get a mix of existing items (in my database) plus some additional new itmes since from time to time items disapear on the marketplace. I also filter the title of the items because there are many items using the same.

I have been reading here lots af tips and finally decided to use php´s fgetcsv() and thought this function will not exhaust memory, since it reads the file line by line. But no matter what I try I´m always running out of memory. I can not understand why my code uses so much memory.

I set the memory limit to 4096MB, time limit is 0. Server has 64 GB Ram and two SSD hardisks.

May someone please check out my piece of code and explain how it is possible that im running out of memory and more important how memory is used?

private function performSearchByASINs()
{
    $found = 0;
    $needed = 0;
    $minimum = 84;
    if(is_array($this->searchASINs) && !empty($this->searchASINs))
    {
        $needed = count($this->searchASINs);
    }
    if($this->searchFeed == NULL || $this->searchFeed == '')
    {
        return false;
    }
    $csv = fopen($this->searchFeed, 'r');
    if($csv)
    {
        $l = 0;
        $title_array = array();
        while(($line = fgetcsv($csv, 0, ',', '"')) !== false)
        {
            $header = array();
            if(trim($line[6]) != '')
            {
                if($l == 0)
                {
                    $header = $line;
                }
                else
                {
                    $asin = $line[0];
                    $title = $this->prepTitleDesc($line[6]);
                    if(is_array($this->searchASINs) 
                    && !empty($this->searchASINs) 
                    && in_array($asin, $this->searchASINs)) //search for existing items to get them updated
                    {
                        $add = true;
                        if(in_array($title, $title_array))
                        {
                            $add = false; 
                        }
                        if($add === true)
                        {
                            $this->itemsByASIN[$asin] = new stdClass();
                            foreach($header as $k => $key)
                            {
                                if(isset($line[$k]))
                                {
                                    $this->itemsByASIN[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
                                }
                            }
                            $title_array[] = $title;
                            $found++;
                        }
                    }
                    if(($line[20] == $this->bnid || $line[21] == $this->bnid) 
                    && count($this->itemsByKey) < $minimum 
                    && !isset($this->itemsByASIN[$asin])) // searching for new items
                    {
                        $add = true;
                        if(in_array($title, $title_array))
                        {
                           $add = false;
                        }
                        if($add === true)
                        {
                            $this->itemsByKey[$asin] = new stdClass();
                            foreach($header as $k => $key)
                            {
                                if(isset($line[$k]))
                                {
                                    $this->itemsByKey[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));                                
                                }
                            }
                            $title_array[] = $title;
                            $found++;
                        }
                    }
                }
                $l++;
                if($l > 200000 || $found == $minimum)
                {
                    break;
                }
            }
        }
        fclose($csv);
    }
}
  • 写回答

3条回答 默认 最新

  • dongshan8194 2015-09-17 04:39
    关注

    It is very hard to manage large data using array without encountering timeout issue. Instead why not parse this datafeed to a database table and do the heavy lifting from there.

    评论

报告相同问题?

悬赏问题

  • ¥15 微信会员卡接入微信支付商户号收款
  • ¥15 如何获取烟草零售终端数据
  • ¥15 数学建模招标中位数问题
  • ¥15 phython路径名过长报错 不知道什么问题
  • ¥15 深度学习中模型转换该怎么实现
  • ¥15 HLs设计手写数字识别程序编译通不过
  • ¥15 Stata外部命令安装问题求帮助!
  • ¥15 从键盘随机输入A-H中的一串字符串,用七段数码管方法进行绘制。提交代码及运行截图。
  • ¥15 TYPCE母转母,插入认方向
  • ¥15 如何用python向钉钉机器人发送可以放大的图片?