从Excel处理CSV文件到MySQL生成“不正确的字符串值”错误

I've done a lot of searching through blogs, Google, and Stack Overflow. I've yet to find a working solution for my problem.

In my PHP Application it allows users to download a csv template (contains the headers) to be filled in for importing data into the system. Everything works great unless they use a special/foreign characters (Umlaut, Acute, Grave) in the CSV file for one of the rows importing.

Users are downloading the CSV and then opening it in Excel (default on most systems with office installed). From what I've seen and understand when they add everything to the file that they want to be imported and click save in Excel it's not encoding it properly. Once they upload the changed file and PHP iterates over the CSV inserting data into the MySQL database it fails with something like "Unable to update record 1366:Incorrect string value: '\x9Arn's ...' for column 'rawContents' at row 1".

I'm not looking for a solution like "Don't use Excel" as that's not an option. I'm looking for a solution to take the uploaded file and make sure the encoding is set to UTF-8 so it's read properly into the database. Currently I am catching the Exception and if it contains the error "Incorrect string value" I output a friendly message to the user that there is invalid data, check encoding, and try again. I want to be able to process their CSV regardless and the rows with invalid data (if I can't read it in) would be ignored and stored for what I call "error rows" (any row that contains an error (invalid user input for column through validation) they can see what row, why, and export another CSV containing just the rows with errors)

I hope that's not too confusing or unclear. I found a way to detect a row that has non-utf8 characters in it using the following:

function utf8_clean($str, $ignore = true)
{
    return iconv('UTF-8', 'UTF-8//' . (($ignore) ? 'IGNORE' : 'TRANSLIT'), $str);
}

function contains_non_utf8($str)
{
    return (serialize($str) != serialize(utf8_clean($str)));
}

If there's some way to fix the encoding and get the correct character encoding to store it that'd be great. The second option I wanted to do is the "error rows" I mentioned, so if I can't get it in the right encoding I want to store it for exporting the "error rows" CSV files to fix those errors. But I don't know how I can store the "raw" contains of that row to allow exporting it as an error row in CSV.

Please feel free to throw ideas out there on what I can do about this. One option I've thought about going is supporting excel file importing since it seems to retain the UTF-8 encoding on saving if set on the template file, but I'd really like to see a way to still support CSV.

I'm attempting to use "macroman" to get the data into the database which seems very effective but running into some issues with this route as well.

Right now I've got a try/catch statement similar to:

try {
    $this->saveImportRow($array)
} catch (Exception $e) {
    if ($e->getCode() === 1366) {
        $dbClass->execute('SET NAMES \'macroman\'');
        $this->saveImportRow($array)
        $dbClass->execute('SET NAMES \'utf8\' COLLATE \'utf8_unicode_ci\'');
    }
}

What this does is attempts to save the CSV data to database, if it fails with error code 1366 then it will try again but use "macroman" prior to saving. This seems to work properly and allows both importing of a CSV file that was opened in Excel and saved but contains a special characters (i.e. ö) and excel doesn't save it with the proper encoding. It also allows CSV file that was encoded as UTF-8 and contains a special character (i.e. ö).

The issue now is pulling the data out and using it (processing the import).

When the data is saved it's put into an array for mapping (key is the database column the value belongs to) this array is serialized and saved in the database as parsedData. The issue is unserializing that data. When the row was inserted using utf-8 there is no issue unserializing the data as it works as usual, but with the new change to get the special char into the database when the encoding was not proper on the CSV file which means it used "macroman."

If I do a "SET NAMES 'macroman'" prior to selecting the rows the rows inserted using "macroman" work and are unserializable, but UTF-8 inserted rows are not unserializable. Very frustrating. Any ideas?

I know my goal is really to just let the user know the encoding wasn't proper but I thought it was interesting I was able to get them into the database and out and import properly using macroman, just not consistant if there is a proper encoded CSV uploaded. Maybe I need to make the import itself to know if it's "macroman" or not, as I can assume if it has to insert 1 of the rows in the CSV file as macroman then the entire file is encoded wrong. Or I guess my goal is kind of met as I know I can mark the row as special character with invalid encoding and just let them know to fix their encoding. But I'm sure all prefer a more hands off approach for users.

Maybe the import process needs a complete rethink/refresh but I'm not sure. More comments/solutions/idea would be greatly appreciated.

After struggling with conversions and lots more research I came across some reasoning on detecting whether it's Mac Roman or Windows-1252 encoded (default encoding when opening a CSV file in Excel, making changes and saving).

Here's the logic I came down to: * If the string contains one of the following bytes then assume MacRoman: 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1 * If the string contains one of the following bytes then assume Windows-1252: 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6

So using the contains_non_utf8 function to detect when a string first has character(s) in it that's not utf-8 encoded, then a function to detect whether it's MacRoman or Windows-1252. After that I can simply run iconv('MACROMAN', 'UTF-8', $str) or iconv('Windows-1252', 'UTF-8', $str) to receive UTF-8 valid string to go forward with.

Here's two new functions I came up with for the detecting bytes

function isMacRomanEncoded($str) {
    $testBytes = array(0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1);
    foreach ($testBytes as $testByte) {
        if (mb_strpos($str, chr($testByte)) !== false) {
            return true;
        }
    }
    return false;
}

function isWindows1252Encoded($str)
{
    $testBytes = array(0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6);
    foreach ($testBytes as $testByte) {
        if (mb_strpos($str, chr($testByte)) !== false) {
            return true;
        }
    }
    return false;
}

Also another thought is if contains_non_utf8 is true then also do a "mb_detect_encoding" first and then attempt further with the MacRoman/Windows-1252 detections.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongshixingga7900 2014-01-17 23:08
关注
I'm going to go ahead and answer my own question with the solution I ended up with.

As you'll read above in the question the last update I added is pretty much the end solution.

When reading data from the CSV I use the is_non_utf8 check on the string and if true I run the following logic: * If the string contains one of the following bytes then assume MacRoman: 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1 * If the string contains one of the following bytes then assume Windows-1252: 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6

If one of the above is true/assumed then i use iconv to convert the string to UTF-8. If not then I do nothing to the string and continue as usual.

So using the contains_non_utf8 function to detect when a string first has character(s) in it that's not utf-8 encoded, then a function to detect whether it's MacRoman or Windows-1252. After that I can simply run iconv('MACROMAN', 'UTF-8', $str) or iconv('Windows-1252', 'UTF-8', $str) to receive UTF-8 valid string to go forward with.

On inserting I've wrapped the insert query into a try/catch statement, in the catch I look for the error code of "1366" and if that's true I update the row of data to exclude the data but mark the row as an error record with an error message as well. Although this never allows me to provide an export back to the user with the data I'm unable to import it does provide them with the line number so they can look at back at the upload file they used to determine the record that failed to be imported.

So there you have it. This is how I achieved the ability for a user to download a template CSV, open it in Excel (Mac or Windows), add data that includes umlaut (or another foreign character that's available in UTF-8), click save, pick file in file html input, submit, and imports successfully/correctly. It's going into use within the next month so if anything else comes up I'll be sure to update this ticket with those details.

Here's the functions I'm using:

function isMacRomanEncoded($str) { $testBytes = array(0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, 0xE1); foreach ($testBytes as $testByte) { if (mb_strpos($str, chr($testByte)) !== false) { return true; } } return false; } function isWindows1252Encoded($str) { $testBytes = array(0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, 0xF6); foreach ($testBytes as $testByte) { if (mb_strpos($str, chr($testByte)) !== false) { return true; } } return false; }

Here's an example of the catch statement mentioned:

try { return $this->saveImportRow($array) } catch (Exception $e) { if ($e->getCode() === 1366) { $array['dataColumn'] = null; $array['status'] = '2'; // 2 = Error $array['msg'] = 'Row contained invalid characters'; return $this->saveImportRow($array) } throw $e; }

If you have any questions (or further input) let me know.

Thanks!
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

从Excel处理CSV文件到MySQL生成“不正确的字符串值”错误 mysql php
2013-12-20 00:15

回答 2 已采纳 I'm going to go ahead and answer my own question with the solution I ended up with. As you'll rea
从MySQL结果生成.CSV文件[重复]
2019-09-23 20:50

回答 1 已采纳 sqltocsv.WriteFile(...) should create the file for you if it doesn't exist. Under the hood, it ju
csv文件导出的数据到字典了，怎么把字符串改成int？ python
2022-05-13 18:27

回答 2 已采纳如果int()不行，那说明字符串本身可能有问题，能发一下csv文件的部分内容吗？比如是哪些字符串不能用int()来转换？
【大数据系列之MySQL】（二十一）：Navicat读取本地csv文件到MySQL数据库中
2022-12-17 20:56

海洋之心的博客因为我们的csv文件和数据库中的表头名字可能不同，所以需要在这个环节进行映射，但是由于本例子使用的字段名是一样的，所以就不需要调整了，不然需要将源字段和目标字段进行映射。如果我们的数据中存在表头即字段名...
mysql workbEnch 导入csv文件出现报错 mysql 数据库有问必答
2021-12-02 18:04

回答 2 已采纳用bgk编码导入数据表再处理，或者建一个新的csv文件，把内容拷贝过去，再导入。
如何在csv文件中的空白位置上将空字符串值添加到数据库中
2018-06-27 11:15

回答 1 已采纳 Make sure that you have slice elements for record[0] through record[7]. Add empty padding elements
Python可以完整储存带有逗号的字符串到csv文件吗 python 有问必答
2021-12-28 23:09

回答 1 已采纳你写入时加, delimiter=" "就是要用空格分隔.在读取时也要加 delimiter=" "啊 import csv with open("a.csv","w",newline="", enc
mysql 大数据进行导出成csv文件
2021-04-15 17:00

darling331的博客直接在数据库服务器上将数据导出成固定文件，并指定格式，gbk格式是可以被csv文件识别的，不会乱码默认不加格式是utf-8的 SELECT * FROM waybill where created >='2020-11-01' and created <'2020-12-01' ...
csv文件无法被识别，不知道该如何处理 jupyter python 开发语言
2022-03-29 22:33

回答 2 已采纳你确定文件名打对了么亦或是文件放对位置了没（同一个目录下），我这正常运行，
使用python 实现对CSV文件数据的处理 python 大数据
2022-03-18 16:05

回答 2 已采纳 import pandas as pd import re df = pd.DataFrame({'Category':['C,D','A,B,C','A,D','C','A,D','A,B,C','
kivy程序生成的csv文件存储到哪里了？ python
2023-02-14 22:47

回答 1 已采纳在手机上运行的Kivy程序生成的CSV文件通常存储在应用程序的数据目录中。这个目录通常是应用程序的私有目录，其他应用程序无法访问。你可以使用Python的os模块和Kivy的App类来获取应用程序的
thinkphp6快速导出百万级数据到CSV或者EXCEL文件
2021-11-28 22:04

慢慢成长1688的博客很多时候，因为数据统计，我们需要将数据库的数据导出到Excel等文件中，以供数据人员进行查看，如果数据集不大，其实很容易；但是如果对于大数集的导出，将要考虑各种性能的问题，这里以导出数据库一百万条数据为例...
从CSV文件中获取值并搜索MySQL数据库 mysql php
2016-01-20 17:57

回答 1 已采纳 Use the following query with limit: SELECT * // select all fields FR
大数据ETL开发之图解Kettle工具入门到精通（附上kettle安装包）
2021-09-25 07:14

袁袁袁袁满的博客 大数据ETL开发之图解Kettle工具入门到精通（附上kettle安装包）
PHP 导入导出excel、csv百万数据到数据库
2020-01-02 13:00

4927525的博客 PHP 导入导出excel、csv百万数据到数据库待解决: wamp下导入导出百万数据没有问题 lamp下导入10W条数据没问题，导入50W及以上会出现nginx504报错代码包地址测试数据表地址表中有大概110W+条数据，下载时可能会...
没有解决我的问题, 去提问

悬赏问题

¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 ubuntu子系统密码忘记
¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
¥15 保护模式-系统加载-段寄存器
¥15 电脑桌面设定一个区域禁止鼠标操作
¥15 求NPF226060磁芯的详细资料

从Excel处理CSV文件到MySQL生成“不正确的字符串值”错误

2条回答 默认 最新

悬赏问题

2条回答默认最新