从Excel处理CSV文件到MySQL生成“不正确的字符串值”错误

I've done a lot of searching through blogs, Google, and Stack Overflow. I've yet to find a working solution for my problem.

In my PHP Application it allows users to download a csv template (contains the headers) to be filled in for importing data into the system. Everything works great unless they use a special/foreign characters (Umlaut, Acute, Grave) in the CSV file for one of the rows importing.

Users are downloading the CSV and then opening it in Excel (default on most systems with office installed). From what I've seen and understand when they add everything to the file that they want to be imported and click save in Excel it's not encoding it properly. Once they upload the changed file and PHP iterates over the CSV inserting data into the MySQL database it fails with something like "Unable to update record 1366:Incorrect string value: '\x9Arn's ...' for column 'rawContents' at row 1".

I'm not looking for a solution like "Don't use Excel" as that's not an option. I'm looking for a solution to take the uploaded file and make sure the encoding is set to UTF-8 so it's read properly into the database. Currently I am catching the Exception and if it contains the error "Incorrect string value" I output a friendly message to the user that there is invalid data, check encoding, and try again. I want to be able to process their CSV regardless and the rows with invalid data (if I can't read it in) would be ignored and stored for what I call "error rows" (any row that contains an error (invalid user input for column through validation) they can see what row, why, and export another CSV containing just the rows with errors)

I hope that's not too confusing or unclear. I found a way to detect a row that has non-utf8 characters in it using the following:

function utf8_clean($str, $ignore = true)
{
    return iconv('UTF-8', 'UTF-8//' . (($ignore) ? 'IGNORE' : 'TRANSLIT'), $str);
}

function contains_non_utf8($str)
{
    return (serialize($str) != serialize(utf8_clean($str)));
}

If there's some way to fix the encoding and get the correct character encoding to store it that'd be great. The second option I wanted to do is the "error rows" I mentioned, so if I can't get it in the right encoding I want to store it for exporting the "error rows" CSV files to fix those errors. But I don't know how I can store the "raw" contains of that row to allow exporting it as an error row in CSV.

Please feel free to throw ideas out there on what I can do about this. One option I've thought about going is supporting excel file importing since it seems to retain the UTF-8 encoding on saving if set on the template file, but I'd really like to see a way to still support CSV.

I'm attempting to use "macroman" to get the data into the database which seems very effective but running into some issues with this route as well.

Right now I've got a try/catch statement similar to:

try {
    $this->saveImportRow($array)
} catch (Exception $e) {
    if ($e->getCode() === 1366) {
        $dbClass->execute('SET NAMES \'macroman\'');
        $this->saveImportRow($array)
        $dbClass->execute('SET NAMES \'utf8\' COLLATE \'utf8_unicode_ci\'');
    }
}

What this does is attempts to save the CSV data to database, if it fails with error code 1366 then it will try again but use "macroman" prior to saving. This seems to work properly and allows both importing of a CSV file that was opened in Excel and saved but contains a special characters (i.e. ö) and excel doesn't save it with the proper encoding. It also allows CSV file that was encoded as UTF-8 and contains a special character (i.e. ö).

The issue now is pulling the data out and using it (processing the import).

When the data is saved it's put into an array for mapping (key is the database column the value belongs to) this array is serialized and saved in the database as parsedData. The issue is unserializing that data. When the row was inserted using utf-8 there is no issue unserializing the data as it works as usual, but with the new change to get the special char into the database when the encoding was not proper on the CSV file which means it used "macroman."

If I do a "SET NAMES 'macroman'" prior to selecting the rows the rows inserted using "macroman" work and are unserializable, but UTF-8 inserted rows are not unserializable. Very frustrating. Any ideas?

I know my goal is really to just let the user know the encoding wasn't proper but I thought it was interesting I was able to get them into the database and out and import properly using macroman, just not consistant if there is a proper encoded CSV uploaded. Maybe I need to make the import itself to know if it's "macroman" or not, as I can assume if it has to insert 1 of the rows in the CSV file as macroman then the entire file is encoded wrong. Or I guess my goal is kind of met as I know I can mark the row as special character with invalid encoding and just let them know to fix their encoding. But I'm sure all prefer a more hands off approach for users.

Maybe the import process needs a complete rethink/refresh but I'm not sure. More comments/solutions/idea would be greatly appreciated.

After struggling with conversions and lots more research I came across some reasoning on detecting whether it's Mac Roman or Windows-1252 encoded (default encoding when opening a CSV file in Excel, making changes and saving).

Here's the logic I came down to: * If the string contains one of the following bytes then assume MacRoman: 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1 * If the string contains one of the following bytes then assume Windows-1252: 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6

So using the contains_non_utf8 function to detect when a string first has character(s) in it that's not utf-8 encoded, then a function to detect whether it's MacRoman or Windows-1252. After that I can simply run iconv('MACROMAN', 'UTF-8', $str) or iconv('Windows-1252', 'UTF-8', $str) to receive UTF-8 valid string to go forward with.

Here's two new functions I came up with for the detecting bytes

function isMacRomanEncoded($str) {
    $testBytes = array(0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1);
    foreach ($testBytes as $testByte) {
        if (mb_strpos($str, chr($testByte)) !== false) {
            return true;
        }
    }
    return false;
}

function isWindows1252Encoded($str)
{
    $testBytes = array(0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6);
    foreach ($testBytes as $testByte) {
        if (mb_strpos($str, chr($testByte)) !== false) {
            return true;
        }
    }
    return false;
}

Also another thought is if contains_non_utf8 is true then also do a "mb_detect_encoding" first and then attempt further with the MacRoman/Windows-1252 detections.

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongshixingga7900 2014-01-17 23:08
关注
I'm going to go ahead and answer my own question with the solution I ended up with.

As you'll read above in the question the last update I added is pretty much the end solution.

When reading data from the CSV I use the is_non_utf8 check on the string and if true I run the following logic: * If the string contains one of the following bytes then assume MacRoman: 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1 * If the string contains one of the following bytes then assume Windows-1252: 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6

If one of the above is true/assumed then i use iconv to convert the string to UTF-8. If not then I do nothing to the string and continue as usual.

So using the contains_non_utf8 function to detect when a string first has character(s) in it that's not utf-8 encoded, then a function to detect whether it's MacRoman or Windows-1252. After that I can simply run iconv('MACROMAN', 'UTF-8', $str) or iconv('Windows-1252', 'UTF-8', $str) to receive UTF-8 valid string to go forward with.

On inserting I've wrapped the insert query into a try/catch statement, in the catch I look for the error code of "1366" and if that's true I update the row of data to exclude the data but mark the row as an error record with an error message as well. Although this never allows me to provide an export back to the user with the data I'm unable to import it does provide them with the line number so they can look at back at the upload file they used to determine the record that failed to be imported.

So there you have it. This is how I achieved the ability for a user to download a template CSV, open it in Excel (Mac or Windows), add data that includes umlaut (or another foreign character that's available in UTF-8), click save, pick file in file html input, submit, and imports successfully/correctly. It's going into use within the next month so if anything else comes up I'll be sure to update this ticket with those details.

Here's the functions I'm using:

function isMacRomanEncoded($str) { $testBytes = array(0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1); foreach ($testBytes as $testByte) { if (mb_strpos($str, chr($testByte)) !== false) { return true; } } return false; } function isWindows1252Encoded($str) { $testBytes = array(0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6); foreach ($testBytes as $testByte) { if (mb_strpos($str, chr($testByte)) !== false) { return true; } } return false; }

Here's an example of the catch statement mentioned:

try { return $this->saveImportRow($array) } catch (Exception $e) { if ($e->getCode() === 1366) { $array['dataColumn'] = null; $array['status'] = '2'; // 2 = Error $array['msg'] = 'Row contained invalid characters'; return $this->saveImportRow($array) } throw $e; }

If you have any questions (or further input) let me know.

Thanks!
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

PHP导出MySQL数据到Excel文件(fputcsv)
2020-12-17 23:48

在PHP开发中，有时我们需要将MySQL数据库中的数据导出到Excel文件，以便于数据分析或离线处理。`fputcsv`函数是PHP提供的一种高效、轻量级的方式，用于将数据写入CSV格式的文件。在本文中，我们将深入探讨如何使用`...
mysql 大数据进行导出成csv文件
2021-04-15 17:00

darling331的博客直接在数据库服务器上将数据导出成固定文件，并指定格式，gbk格式是可以被csv文件识别的，不会乱码默认不加格式是utf-8的 SELECT * FROM waybill where created >='2020-11-01' and created <'2020-12-01' ...
Debug批量导入Excel数据mysql.rar
2020-07-03 09:07

本示例涉及的是如何批量导入Excel数据到MySQL数据库，这通常用于数据分析、报表生成或者系统初始化等场景。下面将详细讲解这个过程，并提供相关的知识点。首先，我们需要了解Excel文件。Excel是Microsoft Office...
大数据ETL开发之图解Kettle工具入门到精通（附上kettle安装包）
2021-09-22 09:52

小满大王i的博客 大数据ETL开发之图解Kettle工具入门到精通（附上kettle安装包）
ETL：虚拟机中使用kettle导入.xlsx和.csv文件进HDFS和MySQL中（Mac Linux）
2024-06-21 14:38

人圭小台的博客这里我用到的工具有：navicat、Termius、centos8(桌面版)、kettle-8.2、mysql-8.0、hadoop3.1.3 导入项目素材中所有的.xlsx和.csv文件数据到MySQL数据库 .xlsx和.csv的导入方法一样，只是在选择文件格式时选择对应的...
MySQL与大数据技术的集成：数据仓库与分析
2025-01-13 23:25

杨胜增的博客在探讨MySQL与大数据技术的集成之前，了解一些关键的大数据技术是必要的。将MySQL与大数据技术集成，是应对海量数据和复杂数据分析需求的重要策略。通过合理的数据提取、传输、存储和处理流程，结合强大的大数据技术...
如何利用大模型LLM辅助，使用Python完成将CSV快速导入MySQL数据库
2024-03-29 13:50

gear1023的博客由于数据格式不太规范，导入的数据csv有瑕疵，无法利用工具正常导入，就考虑自己写点工具来用，但不熟Python，chatGPT由于网络问题总是不太好用，利用了文心一言辅助实现了相关代码，生成的代码都提供了详细注释，...
大数据学习路线图：从入门到精通的完整指南
2025-09-17 23:49

光子AI的博客我是Jack，一名资深大数据工程师，拥有5年大数据开发经验，曾参与过电商、金融行业的大数据项目（比如某电商平台的实时推荐系统、某银行的风险预测模型）。我热爱分享技术，希望通过这篇文章帮助更多人进入大数据...
thinkphp6快速导出百万级数据到CSV或者EXCEL文件
2021-11-28 22:04

慢慢成长1688的博客很多时候，因为数据统计，我们需要将数据库的数据导出到Excel等文件中，以供数据人员进行查看，如果数据集不大，其实很容易；但是如果对于大数集的导出，将要考虑各种性能的问题，这里以导出数据库一百万条数据为例...
后端——springboot+CompletableFuture 多线程mysql多表大量数据导出csv文件&解决csv乱码问题
2024-08-18 18:09

罗帅·迪克劳纳的博客可根据实际设置对应的参数/*** csv导出相关的线程池* @return*/2, // 核心线程池大小4, // 最大线程池大小60L, // 线程空闲时间TimeUnit.SECONDS, // 线程空闲时间的单位new LinkedBlockingQueue(8), // 工作队列new...
没有解决我的问题, 去提问

从Excel处理CSV文件到MySQL生成“不正确的字符串值”错误

2条回答 默认 最新

2条回答默认最新