dongyinzheng6572 2019-02-05 11:05
浏览 73
已采纳

处理csv的最快方法,bash vs php vs c / c ++处理速度[关闭]

I have a csv with 5M rows. I have an option to import them at mysql database and then loop the table with php.

db_class=new MysqlDb;
$db_class->ConnectDB();
$query="SELECT * FROM mails WHERE .....";
$result=mysqli_query(MysqlDb::$db, $query);
while($arr=mysqli_fetch_array($result))
{
    //db row here 
}

So I loop all the mails from the the table and process them. IF they contain some bad string, I delete them etc.

This works but is very slow to import 5M rows, is also very slow to loop all of them one by one and edit the rows (delete when they contain bad string).

I am thinking of a better solution for skipping php/mysql at all. I will process the .csv file, line by line and check if the current row contains a specific bad string. I can do that In pure php, like:

$file = file('file.csv');
while (($data = fgetcsv($file)) !== FALSE) {
  //process line
   $data[0];
}

This is the bash script I use to loop all lines of a file

while read line; do    
    sed -i '/badstring/d' ./clean.csv
done < bac.csv

While on python I do

with open("file.csv", "r") as ins:
    array = []
    for line in ins:
      //process line here

A bad line would be like

name@baddomain.com
name@domain (without extension)

etc I have a few criterias for what a bad line is, thats why I didn't bother posting it here.

However for very big files I must try to find a better solution. What do you guys recommend? Should I learn how to do it in c/c++ or bash. Bash I know a little already, so I can make it faster. Is c/+++ much faster than bash for this situation? OR I should stick with bash?

Thank you

  • 写回答

1条回答 默认 最新

  • duan6301 2019-02-05 11:11
    关注

    As for PHP solution, you are looking for fgetcsv. The manual includes the example of iterating the CSV file.

    Or, if you want to be fancy, you can go with league/csv library.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥50 用易语言http 访问不了网页
  • ¥50 safari浏览器fetch提交数据后数据丢失问题
  • ¥15 matlab不知道怎么改,求解答!!
  • ¥15 永磁直线电机的电流环pi调不出来
  • ¥15 用stata实现聚类的代码
  • ¥15 请问paddlehub能支持移动端开发吗?在Android studio上该如何部署?
  • ¥20 docker里部署springboot项目,访问不到扬声器
  • ¥15 netty整合springboot之后自动重连失效
  • ¥15 悬赏!微信开发者工具报错,求帮改
  • ¥20 wireshark抓不到vlan