I've done a lot of searching through blogs, Google, and Stack Overflow. I've yet to find a working solution for my problem.
In my PHP Application it allows users to download a csv template (contains the headers) to be filled in for importing data into the system. Everything works great unless they use a special/foreign characters (Umlaut, Acute, Grave) in the CSV file for one of the rows importing.
Users are downloading the CSV and then opening it in Excel (default on most systems with office installed). From what I've seen and understand when they add everything to the file that they want to be imported and click save in Excel it's not encoding it properly. Once they upload the changed file and PHP iterates over the CSV inserting data into the MySQL database it fails with something like "Unable to update record 1366:Incorrect string value: '\x9Arn's ...' for column 'rawContents' at row 1".
I'm not looking for a solution like "Don't use Excel" as that's not an option. I'm looking for a solution to take the uploaded file and make sure the encoding is set to UTF-8 so it's read properly into the database. Currently I am catching the Exception and if it contains the error "Incorrect string value" I output a friendly message to the user that there is invalid data, check encoding, and try again. I want to be able to process their CSV regardless and the rows with invalid data (if I can't read it in) would be ignored and stored for what I call "error rows" (any row that contains an error (invalid user input for column through validation) they can see what row, why, and export another CSV containing just the rows with errors)
I hope that's not too confusing or unclear. I found a way to detect a row that has non-utf8 characters in it using the following:
function utf8_clean($str, $ignore = true)
{
return iconv('UTF-8', 'UTF-8//' . (($ignore) ? 'IGNORE' : 'TRANSLIT'), $str);
}
function contains_non_utf8($str)
{
return (serialize($str) != serialize(utf8_clean($str)));
}
If there's some way to fix the encoding and get the correct character encoding to store it that'd be great. The second option I wanted to do is the "error rows" I mentioned, so if I can't get it in the right encoding I want to store it for exporting the "error rows" CSV files to fix those errors. But I don't know how I can store the "raw" contains of that row to allow exporting it as an error row in CSV.
Please feel free to throw ideas out there on what I can do about this. One option I've thought about going is supporting excel file importing since it seems to retain the UTF-8 encoding on saving if set on the template file, but I'd really like to see a way to still support CSV.
I'm attempting to use "macroman" to get the data into the database which seems very effective but running into some issues with this route as well.
Right now I've got a try/catch statement similar to:
try {
$this->saveImportRow($array)
} catch (Exception $e) {
if ($e->getCode() === 1366) {
$dbClass->execute('SET NAMES \'macroman\'');
$this->saveImportRow($array)
$dbClass->execute('SET NAMES \'utf8\' COLLATE \'utf8_unicode_ci\'');
}
}
What this does is attempts to save the CSV data to database, if it fails with error code 1366 then it will try again but use "macroman" prior to saving. This seems to work properly and allows both importing of a CSV file that was opened in Excel and saved but contains a special characters (i.e. ö) and excel doesn't save it with the proper encoding. It also allows CSV file that was encoded as UTF-8 and contains a special character (i.e. ö).
The issue now is pulling the data out and using it (processing the import).
When the data is saved it's put into an array for mapping (key is the database column the value belongs to) this array is serialized and saved in the database as parsedData. The issue is unserializing that data. When the row was inserted using utf-8 there is no issue unserializing the data as it works as usual, but with the new change to get the special char into the database when the encoding was not proper on the CSV file which means it used "macroman."
If I do a "SET NAMES 'macroman'" prior to selecting the rows the rows inserted using "macroman" work and are unserializable, but UTF-8 inserted rows are not unserializable. Very frustrating. Any ideas?
I know my goal is really to just let the user know the encoding wasn't proper but I thought it was interesting I was able to get them into the database and out and import properly using macroman, just not consistant if there is a proper encoded CSV uploaded. Maybe I need to make the import itself to know if it's "macroman" or not, as I can assume if it has to insert 1 of the rows in the CSV file as macroman then the entire file is encoded wrong. Or I guess my goal is kind of met as I know I can mark the row as special character with invalid encoding and just let them know to fix their encoding. But I'm sure all prefer a more hands off approach for users.
Maybe the import process needs a complete rethink/refresh but I'm not sure. More comments/solutions/idea would be greatly appreciated.
After struggling with conversions and lots more research I came across some reasoning on detecting whether it's Mac Roman or Windows-1252 encoded (default encoding when opening a CSV file in Excel, making changes and saving).
Here's the logic I came down to: * If the string contains one of the following bytes then assume MacRoman: 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1 * If the string contains one of the following bytes then assume Windows-1252: 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6
So using the contains_non_utf8 function to detect when a string first has character(s) in it that's not utf-8 encoded, then a function to detect whether it's MacRoman or Windows-1252. After that I can simply run iconv('MACROMAN', 'UTF-8', $str) or iconv('Windows-1252', 'UTF-8', $str) to receive UTF-8 valid string to go forward with.
Here's two new functions I came up with for the detecting bytes
function isMacRomanEncoded($str) {
$testBytes = array(0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1);
foreach ($testBytes as $testByte) {
if (mb_strpos($str, chr($testByte)) !== false) {
return true;
}
}
return false;
}
function isWindows1252Encoded($str)
{
$testBytes = array(0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6);
foreach ($testBytes as $testByte) {
if (mb_strpos($str, chr($testByte)) !== false) {
return true;
}
}
return false;
}
Also another thought is if contains_non_utf8 is true then also do a "mb_detect_encoding" first and then attempt further with the MacRoman/Windows-1252 detections.