douwei8295 2019-09-19 18:59
浏览 203
已采纳

清理错误的UTF-8字符串

My gRPC service failed to send a request due to malformed user-data. Turns out the HR user-data has a bad UTF-8 string and gRPC could not encode it. I narrowed the bad field down to this string:

"Gr\351gory Smith" // Gr�gory Smith  (this is coming from an LDAP source)

So I want a way to sanitized such inputs should they contain bad UTF-8 encodings.

Not seeing any obvious sanitization functions in the unicode/utf8 standard package, here's my first naïve attempt:

func naïveSanitizer(in string) (out string) {
    for _, rune := range in {
        out += string(rune)
    }
    return
}

Output:

Before: Valid UTF-8? false  Name: 'Gr�gory Smith' Byte-Count:  13
After:  Valid UTF-8? true   Name: 'Gr�gory Smith' Byte-Count:  15

Playground version

Is there a better or more standard way to salvage as much valid data from a bad UTF-8 string?


The reason I have pause here is because while iterating the string and the bad (3rd) character is encountered, utf8.ValidRune(rune) returns true: https://play.golang.org/p/_FZzeTRLVls

So my follow-up question is, will iterating a string - one rune at a time - will the rune value always be valid? Even though the underlying source string encoding was malformed?


EDIT:

Just to clarify, this data is coming from an LDAP source: 500K user records. Of those 500K records only 15 (fifteen) i.e. ~0.03% return a uf8.ValidString(...) of false.

As @kostix and @peterSO have pointed out, the values may be valid if converted from another encoding (e.g. Latin-1) to UTF-8. Applying this theory to these outlier samples:

https://play.golang.org/p/9BA7W7qQcV3

Name:     "Jean-Fran\u00e7ois Smith" : (good UTF-8) :            : Jean-François Smith
Name:                   "Gr\xe9gory" : (bad  UTF-8) : Latin-1-Fix: Grégory
Name:               "Fr\xe9d\xe9ric" : (bad  UTF-8) : Latin-1-Fix: Frédéric
Name:                 "Fern\xe1ndez" : (bad  UTF-8) : Latin-1-Fix: Fernández
Name:                     "Gra\xf1a" : (bad  UTF-8) : Latin-1-Fix: Graña
Name:                     "Mu\xf1oz" : (bad  UTF-8) : Latin-1-Fix: Muñoz
Name:                     "P\xe9rez" : (bad  UTF-8) : Latin-1-Fix: Pérez
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                  "Gro\xdfmann" : (bad  UTF-8) : Latin-1-Fix: Großmann
Name:                     "Ure\xf1a" : (bad  UTF-8) : Latin-1-Fix: Ureña
Name:                    "Iba\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Ibañez
Name:                     "Nu\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Nuñez
Name:                     "Ba\xd1on" : (bad  UTF-8) : Latin-1-Fix: BaÑon
Name:                  "Gonz\xe1lez" : (bad  UTF-8) : Latin-1-Fix: González
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                 "Guti\xe9rrez" : (bad  UTF-8) : Latin-1-Fix: Gutiérrez
Name:                      "D\xedaz" : (bad  UTF-8) : Latin-1-Fix: Díaz
Name:               "Encarnaci\xf3n" : (bad  UTF-8) : Latin-1-Fix: Encarnación
  • 写回答

3条回答

      报告相同问题?

      相关推荐 更多相似问题

      悬赏问题

      • ¥20 51单片机实训实验报告
      • ¥15 C# 循环读写数据中途突然变慢
      • ¥15 用Java实现双端队列
      • ¥150 ID3决策树实现分类
      • ¥15 multisim10安装后,找不到NI License Manager的程序来安装许可证
      • ¥15 C++银行卡系统 Help!
      • ¥15 R语言数据分析的相关问题
      • ¥15 模型导入SP后贴图纹理只有一个,拆了四张UV的,怎么解决?
      • ¥15 检索带order by 非常慢
      • ¥20 python 爬虫 token 加密方式