清理错误的UTF-8字符串

My gRPC service failed to send a request due to malformed user-data. Turns out the HR user-data has a bad UTF-8 string and gRPC could not encode it. I narrowed the bad field down to this string:

"Gr\351gory Smith" // Gr�gory Smith  (this is coming from an LDAP source)

So I want a way to sanitized such inputs should they contain bad UTF-8 encodings.

Not seeing any obvious sanitization functions in the unicode/utf8 standard package, here's my first naïve attempt:

func naïveSanitizer(in string) (out string) {
    for _, rune := range in {
        out += string(rune)
    }
    return
}

Output:

Before: Valid UTF-8? false  Name: 'Gr�gory Smith' Byte-Count:  13
After:  Valid UTF-8? true   Name: 'Gr�gory Smith' Byte-Count:  15

Playground version

Is there a better or more standard way to salvage as much valid data from a bad UTF-8 string?

The reason I have pause here is because while iterating the string and the bad (3rd) character is encountered, utf8.ValidRune(rune) returns true: https://play.golang.org/p/_FZzeTRLVls

So my follow-up question is, will iterating a string - one rune at a time - will the rune value always be valid? Even though the underlying source string encoding was malformed?

EDIT:

Just to clarify, this data is coming from an LDAP source: 500K user records. Of those 500K records only 15 (fifteen) i.e. ~0.03% return a uf8.ValidString(...) of false.

As @kostix and @peterSO have pointed out, the values may be valid if converted from another encoding (e.g. Latin-1) to UTF-8. Applying this theory to these outlier samples:

https://play.golang.org/p/9BA7W7qQcV3

Name:     "Jean-Fran\u00e7ois Smith" : (good UTF-8) :            : Jean-François Smith
Name:                   "Gr\xe9gory" : (bad  UTF-8) : Latin-1-Fix: Grégory
Name:               "Fr\xe9d\xe9ric" : (bad  UTF-8) : Latin-1-Fix: Frédéric
Name:                 "Fern\xe1ndez" : (bad  UTF-8) : Latin-1-Fix: Fernández
Name:                     "Gra\xf1a" : (bad  UTF-8) : Latin-1-Fix: Graña
Name:                     "Mu\xf1oz" : (bad  UTF-8) : Latin-1-Fix: Muñoz
Name:                     "P\xe9rez" : (bad  UTF-8) : Latin-1-Fix: Pérez
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                  "Gro\xdfmann" : (bad  UTF-8) : Latin-1-Fix: Großmann
Name:                     "Ure\xf1a" : (bad  UTF-8) : Latin-1-Fix: Ureña
Name:                    "Iba\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Ibañez
Name:                     "Nu\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Nuñez
Name:                     "Ba\xd1on" : (bad  UTF-8) : Latin-1-Fix: BaÑon
Name:                  "Gonz\xe1lez" : (bad  UTF-8) : Latin-1-Fix: González
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                 "Guti\xe9rrez" : (bad  UTF-8) : Latin-1-Fix: Gutiérrez
Name:                      "D\xedaz" : (bad  UTF-8) : Latin-1-Fix: Díaz
Name:               "Encarnaci\xf3n" : (bad  UTF-8) : Latin-1-Fix: Encarnación

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanqian6295 2019-09-19 19:34
关注
You could improve your "sanitiser" by dropping invalid runes:

package main import ( "fmt" "strings" ) func notSoNaïveSanitizer(s string) string { var b strings.Builder for _, c := range s { if c == '\uFFFD' { continue } b.WriteRune(c) } return b.String() } func main() { fmt.Println(notSoNaïveSanitizer("Gr\351gory Smith")) }

Playground.

The problem though is that \351 is the character é in Latin-1.

@PeterSO pointed out it also happens to be at the same position in the Unicode's BMP, and that is correct but Unicode is not an encoding, and your data is supposedly encoded, so I think you just have an incorrect assumption about the encoding of your data and it's not UTF-8 but rather Latin-1 (or something compatible with regard to Latin accented letters).

So I'd verify you really are dealing with Latin-1 (or whatever) and if so, golang.org/x/text/encoding provides complete tooling for re-encoding from legacy encodings to UTF-8 (or whatever).

(On a side note, you might as well just not happen to explicitly ask your data source to provide you with UTF-8-encoded data.)
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

清理错误的UTF-8字符串
2019-09-19 18:59

回答 3 已采纳 You could improve your "sanitiser" by dropping invalid runes: package main import ( "fmt"
VB将汉字字符串转换成 UTF-8格式
2015-11-29 13:11

回答 1 已采纳 http://www.williamlong.info/archives/1136.html
SDB是否支持非UTF-8字符
2017-09-12 06:33

回答 1 已采纳目前SDB仅支持UTF-8的字符串。若想插入非UTF-8的字符串，可先用编码转换工具转换成UTF8编码。
字符集转换(UTF-8、ANSI)
2023-04-21 13:53

Minuw的博客 UTF-8字符集转ANSI字符集。ANSI字符集转UTF-8字符集。
可能编码错误，格式错误的UTF-8字符 json laravel php
2018-03-22 14:08

回答 1 已采纳 In my laravel query i am use a following code so it will give a this type of error... Malform
错误: 编码 UTF-8 的不可映射字符 java
2022-04-13 15:36

回答 2 已采纳可是之前一直都使用的ANSI没问题的呀
Golang将UTF16字符串转换为UTF8
2016-10-19 00:29

回答 2 已采纳 Parse the hex string as an integer. Use a string conversion to convert the integer to UTF-8. n, e
Python数据类型--字符串
2022-03-14 21:20

思想在拧紧的博客前言：简述Python数据类型--字符串
如何使用UTF-8字符串检查golang中的字符值？
2016-04-21 06:58

回答 2 已采纳 Indexing a string indexes its bytes (in UTF-8 encoding - this is how Go stores strings in memory),
在Go中将带有UTF-8字节字符串的命令行输出转换为Unicode代码点
2019-04-10 18:21

回答 1 已采纳 You can use the strconv package to parse the string literal containing the escape sequences. The
VB如何正常获取UTF-8中文字符串长度 .net asp.net
2023-01-18 18:01

回答 3 已采纳使用 System.Text.Encoding.UTF8.GetByteCount() 方法，获取字符串的字节数。 Dim byteCount As Integer = System.Text.Enc
C++ 中CString ANSI 与 utf-8转换处理字符集编码
2019-04-03 14:03

longyinfengwu的博客字符集编码时会遇到一些乱码问题，尤其是中文写入文件时是ANSI编码也就是在标准utf-8格式文件中是乱码，顾在写入文件时就对CString进行转码，然后写入文件就OK了。 //UTF8转ANSI void UTF8toANSI(CString &...
C#将String默认的字符编码改为UTF-8 asp.net c#
2020-06-16 20:38

回答 1 已采纳 ``` public static string utf8_gb2312(string text) { //声明字符集 System.Text
utf8 bom 去掉 java_utf-8-BOM删除bom
2021-02-26 15:10

树花的博客 utf-8 bom,去除bom//开始function file_bom($wenjian,$remove = true){//读取文件,将文件写入字符串$contents = file_get_contents($wenjian);//获取整个文件开头三个字节$charset[1] = substr($contents, 0, 1);$...
C++ ANSI 与 utf-8转换
2018-05-25 18:05

奔跑的艾斯的博客 //UTF8转ANSI  void UTF8toANSI(CString &strUTF8)  {      //获取转...
没有解决我的问题, 去提问

悬赏问题

¥15 名为“Product”的列已属于此 DataTable
¥15 安卓adb backup备份应用数据失败
¥15 eclipse运行项目时遇到的问题
¥15 关于#c##的问题：最近需要用CAT工具Trados进行一些开发
¥15 南大pa1 小游戏没有界面，并且报了如下错误，尝试过换显卡驱动，但是好像不行
¥15 没有证书，nginx怎么反向代理到只能接受https的公网网站
¥50 成都蓉城足球俱乐部小程序抢票
¥15 yolov7训练自己的数据集
¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)（相关搜索：51单片机|单片机|测试代码）
¥15 电力市场出清matlab yalmip kkt 双层优化问题