清理错误的UTF-8字符串

My gRPC service failed to send a request due to malformed user-data. Turns out the HR user-data has a bad UTF-8 string and gRPC could not encode it. I narrowed the bad field down to this string:

"Gr\351gory Smith" // Gr�gory Smith  (this is coming from an LDAP source)

So I want a way to sanitized such inputs should they contain bad UTF-8 encodings.

Not seeing any obvious sanitization functions in the unicode/utf8 standard package, here's my first naïve attempt:

func naïveSanitizer(in string) (out string) {
    for _, rune := range in {
        out += string(rune)
    }
    return
}

Output:

Before: Valid UTF-8? false  Name: 'Gr�gory Smith' Byte-Count:  13
After:  Valid UTF-8? true   Name: 'Gr�gory Smith' Byte-Count:  15

Playground version

Is there a better or more standard way to salvage as much valid data from a bad UTF-8 string?

The reason I have pause here is because while iterating the string and the bad (3rd) character is encountered, utf8.ValidRune(rune) returns true: https://play.golang.org/p/_FZzeTRLVls

So my follow-up question is, will iterating a string - one rune at a time - will the rune value always be valid? Even though the underlying source string encoding was malformed?

EDIT:

Just to clarify, this data is coming from an LDAP source: 500K user records. Of those 500K records only 15 (fifteen) i.e. ~0.03% return a uf8.ValidString(...) of false.

As @kostix and @peterSO have pointed out, the values may be valid if converted from another encoding (e.g. Latin-1) to UTF-8. Applying this theory to these outlier samples:

https://play.golang.org/p/9BA7W7qQcV3

Name:     "Jean-Fran\u00e7ois Smith" : (good UTF-8) :            : Jean-François Smith
Name:                   "Gr\xe9gory" : (bad  UTF-8) : Latin-1-Fix: Grégory
Name:               "Fr\xe9d\xe9ric" : (bad  UTF-8) : Latin-1-Fix: Frédéric
Name:                 "Fern\xe1ndez" : (bad  UTF-8) : Latin-1-Fix: Fernández
Name:                     "Gra\xf1a" : (bad  UTF-8) : Latin-1-Fix: Graña
Name:                     "Mu\xf1oz" : (bad  UTF-8) : Latin-1-Fix: Muñoz
Name:                     "P\xe9rez" : (bad  UTF-8) : Latin-1-Fix: Pérez
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                  "Gro\xdfmann" : (bad  UTF-8) : Latin-1-Fix: Großmann
Name:                     "Ure\xf1a" : (bad  UTF-8) : Latin-1-Fix: Ureña
Name:                    "Iba\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Ibañez
Name:                     "Nu\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Nuñez
Name:                     "Ba\xd1on" : (bad  UTF-8) : Latin-1-Fix: BaÑon
Name:                  "Gonz\xe1lez" : (bad  UTF-8) : Latin-1-Fix: González
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                 "Guti\xe9rrez" : (bad  UTF-8) : Latin-1-Fix: Gutiérrez
Name:                      "D\xedaz" : (bad  UTF-8) : Latin-1-Fix: Díaz
Name:               "Encarnaci\xf3n" : (bad  UTF-8) : Latin-1-Fix: Encarnación

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanqian6295 2019-09-19 19:34
关注
You could improve your "sanitiser" by dropping invalid runes:

package main import ( "fmt" "strings" ) func notSoNaïveSanitizer(s string) string { var b strings.Builder for _, c := range s { if c == '\uFFFD' { continue } b.WriteRune(c) } return b.String() } func main() { fmt.Println(notSoNaïveSanitizer("Gr\351gory Smith")) }

Playground.

The problem though is that \351 is the character é in Latin-1.

@PeterSO pointed out it also happens to be at the same position in the Unicode's BMP, and that is correct but Unicode is not an encoding, and your data is supposedly encoded, so I think you just have an incorrect assumption about the encoding of your data and it's not UTF-8 but rather Latin-1 (or something compatible with regard to Latin accented letters).

So I'd verify you really are dealing with Latin-1 (or whatever) and if so, golang.org/x/text/encoding provides complete tooling for re-encoding from legacy encodings to UTF-8 (or whatever).

(On a side note, you might as well just not happen to explicitly ask your data source to provide you with UTF-8-encoded data.)
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

清理错误的UTF-8字符串
2019-09-19 18:59

回答 3 已采纳 You could improve your "sanitiser" by dropping invalid runes: package main import ( "fmt"
VB将汉字字符串转换成 UTF-8格式
2015-11-29 13:11

回答 1 已采纳 http://www.williamlong.info/archives/1136.html
SDB是否支持非UTF-8字符
2017-09-12 06:33

回答 1 已采纳目前SDB仅支持UTF-8的字符串。若想插入非UTF-8的字符串，可先用编码转换工具转换成UTF8编码。
字符集转换(UTF-8、ANSI)
2023-04-21 13:53

Minuw的博客 UTF-8字符集转ANSI字符集。ANSI字符集转UTF-8字符集。
可能编码错误，格式错误的UTF-8字符 json laravel php
2018-03-22 14:08

回答 1 已采纳 In my laravel query i am use a following code so it will give a this type of error... Malform
错误: 编码 UTF-8 的不可映射字符 java
2022-04-13 15:36

回答 2 已采纳可是之前一直都使用的ANSI没问题的呀
Golang将UTF16字符串转换为UTF8
2016-10-19 00:29

回答 2 已采纳 Parse the hex string as an integer. Use a string conversion to convert the integer to UTF-8. n, e
Python数据类型--字符串
2022-03-14 21:20

思想在拧紧的博客前言：简述Python数据类型--字符串
如何使用UTF-8字符串检查golang中的字符值？
2016-04-21 06:58

回答 2 已采纳 Indexing a string indexes its bytes (in UTF-8 encoding - this is how Go stores strings in memory),
在Go中将带有UTF-8字节字符串的命令行输出转换为Unicode代码点
2019-04-10 18:21

回答 1 已采纳 You can use the strconv package to parse the string literal containing the escape sequences. The
VB如何正常获取UTF-8中文字符串长度 .net asp.net
2023-01-18 18:01

回答 3 已采纳使用 System.Text.Encoding.UTF8.GetByteCount() 方法，获取字符串的字节数。 Dim byteCount As Integer = System.Text.Enc
C++ 中CString ANSI 与 utf-8转换处理字符集编码
2019-04-03 14:03

longyinfengwu的博客字符集编码时会遇到一些乱码问题，尤其是中文写入文件时是ANSI编码也就是在标准utf-8格式文件中是乱码，顾在写入文件时就对CString进行转码，然后写入文件就OK了。 //UTF8转ANSI void UTF8toANSI(CString &...
C#将String默认的字符编码改为UTF-8 asp.net c#
2020-06-16 20:38

回答 1 已采纳 ``` public static string utf8_gb2312(string text) { //声明字符集 System.Text
utf8 bom 去掉 java_utf-8-BOM删除bom
2021-02-26 15:10

树花的博客 utf-8 bom,去除bom//开始function file_bom($wenjian,$remove = true){//读取文件,将文件写入字符串$contents = file_get_contents($wenjian);//获取整个文件开头三个字节$charset[1] = substr($contents, 0, 1);$...
C++ ANSI 与 utf-8转换
2018-05-25 18:05

奔跑的艾斯的博客 //UTF8转ANSI  void UTF8toANSI(CString &strUTF8)  {      //获取转...
没有解决我的问题, 去提问

悬赏问题

¥15 对于这个问题的解释说明
¥200 询问：python实现大地主题正反算的程序设计，有偿
¥15 smptlib使用465端口发送邮件失败
¥200 总是报错，能帮助用python实现程序实现高斯正反算吗？有偿
¥15 对于squad数据集的基于bert模型的微调
¥15 为什么我运行这个网络会出现以下报错？CRNN神经网络
¥20 steam下载游戏占用内存
¥15 CST保存项目时失败
¥20 java在应用程序里获取不到扬声器设备
¥15 echarts动画效果的问题，请帮我添加一个动画。不要机器人回答。

清理错误的UTF-8字符串

3条回答 默认 最新

悬赏问题

3条回答默认最新