dongyan1936
dongyan1936
2013-12-20 00:15

从Excel处理CSV文件到MySQL生成“不正确的字符串值”错误

已采纳

I've done a lot of searching through blogs, Google, and Stack Overflow. I've yet to find a working solution for my problem.

In my PHP Application it allows users to download a csv template (contains the headers) to be filled in for importing data into the system. Everything works great unless they use a special/foreign characters (Umlaut, Acute, Grave) in the CSV file for one of the rows importing.

Users are downloading the CSV and then opening it in Excel (default on most systems with office installed). From what I've seen and understand when they add everything to the file that they want to be imported and click save in Excel it's not encoding it properly. Once they upload the changed file and PHP iterates over the CSV inserting data into the MySQL database it fails with something like "Unable to update record 1366:Incorrect string value: '\x9Arn's ...' for column 'rawContents' at row 1".

I'm not looking for a solution like "Don't use Excel" as that's not an option. I'm looking for a solution to take the uploaded file and make sure the encoding is set to UTF-8 so it's read properly into the database. Currently I am catching the Exception and if it contains the error "Incorrect string value" I output a friendly message to the user that there is invalid data, check encoding, and try again. I want to be able to process their CSV regardless and the rows with invalid data (if I can't read it in) would be ignored and stored for what I call "error rows" (any row that contains an error (invalid user input for column through validation) they can see what row, why, and export another CSV containing just the rows with errors)

I hope that's not too confusing or unclear. I found a way to detect a row that has non-utf8 characters in it using the following:

function utf8_clean($str, $ignore = true)
{
    return iconv('UTF-8', 'UTF-8//' . (($ignore) ? 'IGNORE' : 'TRANSLIT'), $str);
}

function contains_non_utf8($str)
{
    return (serialize($str) != serialize(utf8_clean($str)));
}

If there's some way to fix the encoding and get the correct character encoding to store it that'd be great. The second option I wanted to do is the "error rows" I mentioned, so if I can't get it in the right encoding I want to store it for exporting the "error rows" CSV files to fix those errors. But I don't know how I can store the "raw" contains of that row to allow exporting it as an error row in CSV.

Please feel free to throw ideas out there on what I can do about this. One option I've thought about going is supporting excel file importing since it seems to retain the UTF-8 encoding on saving if set on the template file, but I'd really like to see a way to still support CSV.


I'm attempting to use "macroman" to get the data into the database which seems very effective but running into some issues with this route as well.

Right now I've got a try/catch statement similar to:

try {
    $this->saveImportRow($array)
} catch (Exception $e) {
    if ($e->getCode() === 1366) {
        $dbClass->execute('SET NAMES \'macroman\'');
        $this->saveImportRow($array)
        $dbClass->execute('SET NAMES \'utf8\' COLLATE \'utf8_unicode_ci\'');
    }
}

What this does is attempts to save the CSV data to database, if it fails with error code 1366 then it will try again but use "macroman" prior to saving. This seems to work properly and allows both importing of a CSV file that was opened in Excel and saved but contains a special characters (i.e. ö) and excel doesn't save it with the proper encoding. It also allows CSV file that was encoded as UTF-8 and contains a special character (i.e. ö).

The issue now is pulling the data out and using it (processing the import).

When the data is saved it's put into an array for mapping (key is the database column the value belongs to) this array is serialized and saved in the database as parsedData. The issue is unserializing that data. When the row was inserted using utf-8 there is no issue unserializing the data as it works as usual, but with the new change to get the special char into the database when the encoding was not proper on the CSV file which means it used "macroman."

If I do a "SET NAMES 'macroman'" prior to selecting the rows the rows inserted using "macroman" work and are unserializable, but UTF-8 inserted rows are not unserializable. Very frustrating. Any ideas?

I know my goal is really to just let the user know the encoding wasn't proper but I thought it was interesting I was able to get them into the database and out and import properly using macroman, just not consistant if there is a proper encoded CSV uploaded. Maybe I need to make the import itself to know if it's "macroman" or not, as I can assume if it has to insert 1 of the rows in the CSV file as macroman then the entire file is encoded wrong. Or I guess my goal is kind of met as I know I can mark the row as special character with invalid encoding and just let them know to fix their encoding. But I'm sure all prefer a more hands off approach for users.

Maybe the import process needs a complete rethink/refresh but I'm not sure. More comments/solutions/idea would be greatly appreciated.


After struggling with conversions and lots more research I came across some reasoning on detecting whether it's Mac Roman or Windows-1252 encoded (default encoding when opening a CSV file in Excel, making changes and saving).

Here's the logic I came down to: * If the string contains one of the following bytes then assume MacRoman: 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1 * If the string contains one of the following bytes then assume Windows-1252: 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6

So using the contains_non_utf8 function to detect when a string first has character(s) in it that's not utf-8 encoded, then a function to detect whether it's MacRoman or Windows-1252. After that I can simply run iconv('MACROMAN', 'UTF-8', $str) or iconv('Windows-1252', 'UTF-8', $str) to receive UTF-8 valid string to go forward with.

Here's two new functions I came up with for the detecting bytes

function isMacRomanEncoded($str) {
    $testBytes = array(0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1);
    foreach ($testBytes as $testByte) {
        if (mb_strpos($str, chr($testByte)) !== false) {
            return true;
        }
    }
    return false;
}

function isWindows1252Encoded($str)
{
    $testBytes = array(0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6);
    foreach ($testBytes as $testByte) {
        if (mb_strpos($str, chr($testByte)) !== false) {
            return true;
        }
    }
    return false;
}

Also another thought is if contains_non_utf8 is true then also do a "mb_detect_encoding" first and then attempt further with the MacRoman/Windows-1252 detections.

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

2条回答

  • dongshixingga7900 dongshixingga7900 7年前

    I'm going to go ahead and answer my own question with the solution I ended up with.

    As you'll read above in the question the last update I added is pretty much the end solution.

    When reading data from the CSV I use the is_non_utf8 check on the string and if true I run the following logic: * If the string contains one of the following bytes then assume MacRoman: 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1 * If the string contains one of the following bytes then assume Windows-1252: 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6

    If one of the above is true/assumed then i use iconv to convert the string to UTF-8. If not then I do nothing to the string and continue as usual.

    So using the contains_non_utf8 function to detect when a string first has character(s) in it that's not utf-8 encoded, then a function to detect whether it's MacRoman or Windows-1252. After that I can simply run iconv('MACROMAN', 'UTF-8', $str) or iconv('Windows-1252', 'UTF-8', $str) to receive UTF-8 valid string to go forward with.

    On inserting I've wrapped the insert query into a try/catch statement, in the catch I look for the error code of "1366" and if that's true I update the row of data to exclude the data but mark the row as an error record with an error message as well. Although this never allows me to provide an export back to the user with the data I'm unable to import it does provide them with the line number so they can look at back at the upload file they used to determine the record that failed to be imported.

    So there you have it. This is how I achieved the ability for a user to download a template CSV, open it in Excel (Mac or Windows), add data that includes umlaut (or another foreign character that's available in UTF-8), click save, pick file in file html input, submit, and imports successfully/correctly. It's going into use within the next month so if anything else comes up I'll be sure to update this ticket with those details.

    Here's the functions I'm using:

    function isMacRomanEncoded($str) {
        $testBytes = array(0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1);
        foreach ($testBytes as $testByte) {
            if (mb_strpos($str, chr($testByte)) !== false) {
                return true;
            }
        }
        return false;
    }
    
    function isWindows1252Encoded($str)
    {
        $testBytes = array(0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6);
        foreach ($testBytes as $testByte) {
            if (mb_strpos($str, chr($testByte)) !== false) {
                return true;
            }
        }
        return false;
    }
    

    Here's an example of the catch statement mentioned:

    try {
        return $this->saveImportRow($array)
    } catch (Exception $e) {
        if ($e->getCode() === 1366) {
            $array['dataColumn'] = null;
            $array['status'] = '2'; // 2 = Error
            $array['msg'] = 'Row contained invalid characters';
            return $this->saveImportRow($array)
        }
        throw $e;
    }
    

    If you have any questions (or further input) let me know.

    Thanks!

    点赞 评论 复制链接分享
  • doujiang2812 doujiang2812 8年前

    In my experience what is happening here is that Excel is defaulting to it's default encoding because your template csv file does not contain a BOM (Byte Order Mark).

    Because the csv file is text, if you are creating the template in php (and you are sure that the contents of the file truley are utf-8) you can enforce that the file opens in Excel correctly (For Windows at least) using something like the following:

    $filecontents = chr(239) . chr(187) . chr(191) . $filecontents;
    

    Now presuming that you are on Windows, notepad++ will also add a BOM to a text file for you, so you could also edit the template this way.

    Another useful thing to try (to test the potential encoding issue of the import - post Excel) is to open the file in notepad first and save as utf-8, and then import, see if this fixes the issue.

    Now providing the user does not change the encoding when they save Excel should default to utf-8 which should now read into php fine.

    点赞 评论 复制链接分享