I have a csv with 5M rows. I have an option to import them at mysql database and then loop the table with php.
db_class=new MysqlDb;
$db_class->ConnectDB();
$query="SELECT * FROM mails WHERE .....";
$result=mysqli_query(MysqlDb::$db, $query);
while($arr=mysqli_fetch_array($result))
{
//db row here
}
So I loop all the mails from the the table and process them. IF they contain some bad string, I delete them etc.
This works but is very slow to import 5M rows, is also very slow to loop all of them one by one and edit the rows (delete when they contain bad string).
I am thinking of a better solution for skipping php/mysql at all. I will process the .csv file, line by line and check if the current row contains a specific bad string. I can do that In pure php, like:
$file = file('file.csv');
while (($data = fgetcsv($file)) !== FALSE) {
//process line
$data[0];
}
This is the bash script I use to loop all lines of a file
while read line; do
sed -i '/badstring/d' ./clean.csv
done < bac.csv
While on python I do
with open("file.csv", "r") as ins:
array = []
for line in ins:
//process line here
A bad line would be like
name@baddomain.com
name@domain (without extension)
etc I have a few criterias for what a bad line is, thats why I didn't bother posting it here.
However for very big files I must try to find a better solution. What do you guys recommend? Should I learn how to do it in c/c++ or bash. Bash I know a little already, so I can make it faster. Is c/+++ much faster than bash for this situation? OR I should stick with bash?
Thank you