清理错误的UTF-8字符串

My gRPC service failed to send a request due to malformed user-data. Turns out the HR user-data has a bad UTF-8 string and gRPC could not encode it. I narrowed the bad field down to this string:

"Gr\351gory Smith" // Gr�gory Smith  (this is coming from an LDAP source)

So I want a way to sanitized such inputs should they contain bad UTF-8 encodings.

Not seeing any obvious sanitization functions in the unicode/utf8 standard package, here's my first naïve attempt:

func naïveSanitizer(in string) (out string) {
    for _, rune := range in {
        out += string(rune)
    }
    return
}

Output:

Before: Valid UTF-8? false  Name: 'Gr�gory Smith' Byte-Count:  13
After:  Valid UTF-8? true   Name: 'Gr�gory Smith' Byte-Count:  15

Playground version

Is there a better or more standard way to salvage as much valid data from a bad UTF-8 string?

The reason I have pause here is because while iterating the string and the bad (3rd) character is encountered, utf8.ValidRune(rune) returns true: https://play.golang.org/p/_FZzeTRLVls

So my follow-up question is, will iterating a string - one rune at a time - will the rune value always be valid? Even though the underlying source string encoding was malformed?

EDIT:

Just to clarify, this data is coming from an LDAP source: 500K user records. Of those 500K records only 15 (fifteen) i.e. ~0.03% return a uf8.ValidString(...) of false.

As @kostix and @peterSO have pointed out, the values may be valid if converted from another encoding (e.g. Latin-1) to UTF-8. Applying this theory to these outlier samples:

https://play.golang.org/p/9BA7W7qQcV3

Name:     "Jean-Fran\u00e7ois Smith" : (good UTF-8) :            : Jean-François Smith
Name:                   "Gr\xe9gory" : (bad  UTF-8) : Latin-1-Fix: Grégory
Name:               "Fr\xe9d\xe9ric" : (bad  UTF-8) : Latin-1-Fix: Frédéric
Name:                 "Fern\xe1ndez" : (bad  UTF-8) : Latin-1-Fix: Fernández
Name:                     "Gra\xf1a" : (bad  UTF-8) : Latin-1-Fix: Graña
Name:                     "Mu\xf1oz" : (bad  UTF-8) : Latin-1-Fix: Muñoz
Name:                     "P\xe9rez" : (bad  UTF-8) : Latin-1-Fix: Pérez
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                  "Gro\xdfmann" : (bad  UTF-8) : Latin-1-Fix: Großmann
Name:                     "Ure\xf1a" : (bad  UTF-8) : Latin-1-Fix: Ureña
Name:                    "Iba\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Ibañez
Name:                     "Nu\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Nuñez
Name:                     "Ba\xd1on" : (bad  UTF-8) : Latin-1-Fix: BaÑon
Name:                  "Gonz\xe1lez" : (bad  UTF-8) : Latin-1-Fix: González
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                 "Guti\xe9rrez" : (bad  UTF-8) : Latin-1-Fix: Gutiérrez
Name:                      "D\xedaz" : (bad  UTF-8) : Latin-1-Fix: Díaz
Name:               "Encarnaci\xf3n" : (bad  UTF-8) : Latin-1-Fix: Encarnación

展开全部

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanqian6295 2019-09-19 11:34
关注
You could improve your "sanitiser" by dropping invalid runes:

package main import ( "fmt" "strings" ) func notSoNaïveSanitizer(s string) string { var b strings.Builder for _, c := range s { if c == '\uFFFD' { continue } b.WriteRune(c) } return b.String() } func main() { fmt.Println(notSoNaïveSanitizer("Gr\351gory Smith")) }

Playground.

The problem though is that \351 is the character é in Latin-1.

@PeterSO pointed out it also happens to be at the same position in the Unicode's BMP, and that is correct but Unicode is not an encoding, and your data is supposedly encoded, so I think you just have an incorrect assumption about the encoding of your data and it's not UTF-8 but rather Latin-1 (or something compatible with regard to Latin accented letters).

So I'd verify you really are dealing with Latin-1 (or whatever) and if so, golang.org/x/text/encoding provides complete tooling for re-encoding from legacy encodings to UTF-8 (or whatever).

(On a side note, you might as well just not happen to explicitly ask your data source to provide you with UTF-8-encoded data.)
展开全部

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报
编辑

预览
轻敲空格完成输入
显示为

卡片

标题

链接
评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

编辑

预览

报告相同问题？

关注问题

在Go中将带有UTF-8字节字符串的命令行输出转换为Unicode代码点
2019-04-10 10:21

回答 1 已采纳 You can use the strconv package to parse the string literal containing the escape sequences. The
VB将汉字字符串转换成 UTF-8格式
2015-11-29 05:11

回答 1 已采纳 http://www.williamlong.info/archives/1136.html
SDB是否支持非UTF-8字符
2017-09-11 22:33

回答 1 已采纳目前SDB仅支持UTF-8的字符串。若想插入非UTF-8的字符串，可先用编码转换工具转换成UTF8编码。
C++使用WideCharToMultiByte函数生成UTF-8编码文件的方法
2020-09-01 19:13

2. **计算Unicode字符串的长度**：使用WideCharToMultiByte函数，传入CP_UTF8作为`CodePage`，并设置`cchMultiByte`为0，得到转换后的UTF-8字符串所需的字节数。 3. **分配缓冲区**：根据上一步的结果，分配足够的...
可能编码错误，格式错误的UTF-8字符 json laravel php
2018-03-22 06:08

回答 1 已采纳 In my laravel query i am use a following code so it will give a this type of error... Malform
VB如何正常获取UTF-8中文字符串长度 .net asp.net
2023-01-18 10:01

回答 3 已采纳使用 System.Text.Encoding.UTF8.GetByteCount() 方法，获取字符串的字节数。 Dim byteCount As Integer = System.Text.Enc
错误: 编码 UTF-8 的不可映射字符 java
2022-04-13 07:36

回答 2 已采纳可是之前一直都使用的ANSI没问题的呀
jsp上传组件smartUpload_utf-8_jar包，解决utf-8编码的乱码问题
2021-01-16 21:30

这将确保在上传过程中所有字符串都被转换为UTF-8格式。 5. **使用SmartUpload进行文件上传** 要使用`SmartUpload`进行文件上传，首先需要在JSP页面上创建一个`<form>`标签，并设置`enctype="multipart/form-data...
Golang将UTF16字符串转换为UTF8
2016-10-18 16:29

回答 2 已采纳 Parse the hex string as an integer. Use a string conversion to convert the integer to UTF-8. n, e
如何使用UTF-8字符串检查golang中的字符值？
2016-04-20 22:58

回答 2 已采纳 Indexing a string indexes its bytes (in UTF-8 encoding - this is how Go stores strings in memory),
utf-8 Carbon格式的错误字符 laravel php
2018-05-14 10:37

回答 1 已采纳 I had the same problem when trying to use the sk_SK.UTF-8 locale. What helped me to solve the prob
多字节与UTF-8、Unicode之间的转换
2011-10-26 14:47

- 先使用`MultiByteToWideChar`将UTF-8字符串转换为宽字符字符串。 - 然后通过`WideCharToMultiByte`函数将宽字符字符串转换为多字节编码的字符串。 3. **MBToUnicode** 此函数负责将多字节编码转换为Unicode...
C#将String默认的字符编码改为UTF-8 asp.net c#
2020-06-16 12:38

回答 1 已采纳 ``` public static string utf8_gb2312(string text) { //声明字符集 System.Text
utf8 bom 去掉 java_utf-8-BOM删除bom
2021-02-26 07:10

树花的博客 utf-8 bom,去除bom//开始function file_bom($wenjian,$remove = true){//读取文件,将文件写入字符串$contents = file_get_contents($wenjian);//获取整个文件开头三个字节$charset[1] = substr($contents, 0, 1);$...
字符集转换(UTF-8、ANSI)
2023-04-21 05:53

Minuw的博客 UTF-8字符集转ANSI字符集。ANSI字符集转UTF-8字符集。
没有解决我的问题, 去提问

清理错误的UTF-8字符串

3条回答 默认 最新

3条回答默认最新