My gRPC
service failed to send a request due to malformed user-data. Turns out the HR user-data has a bad UTF-8
string and gRPC
could not encode it. I narrowed the bad field down to this string:
"Gr\351gory Smith" // Gr�gory Smith (this is coming from an LDAP source)
So I want a way to sanitized such inputs should they contain bad UTF-8
encodings.
Not seeing any obvious sanitization functions in the unicode/utf8
standard package, here's my first naïve attempt:
func naïveSanitizer(in string) (out string) {
for _, rune := range in {
out += string(rune)
}
return
}
Output:
Before: Valid UTF-8? false Name: 'Gr�gory Smith' Byte-Count: 13
After: Valid UTF-8? true Name: 'Gr�gory Smith' Byte-Count: 15
Is there a better or more standard way to salvage as much valid data from a bad UTF-8
string?
The reason I have pause here is because while iterating the string and the bad (3rd) character is encountered, utf8.ValidRune(rune)
returns true
: https://play.golang.org/p/_FZzeTRLVls
So my follow-up question is, will iterating a string - one rune at a time - will the rune value always be valid? Even though the underlying source string encoding was malformed?
EDIT:
Just to clarify, this data is coming from an LDAP source: 500K user records. Of those 500K records only 15 (fifteen) i.e. ~0.03% return a uf8.ValidString(...)
of false
.
As @kostix and @peterSO have pointed out, the values may be valid if converted from another encoding (e.g. Latin-1) to UTF-8. Applying this theory to these outlier samples:
https://play.golang.org/p/9BA7W7qQcV3
Name: "Jean-Fran\u00e7ois Smith" : (good UTF-8) : : Jean-François Smith
Name: "Gr\xe9gory" : (bad UTF-8) : Latin-1-Fix: Grégory
Name: "Fr\xe9d\xe9ric" : (bad UTF-8) : Latin-1-Fix: Frédéric
Name: "Fern\xe1ndez" : (bad UTF-8) : Latin-1-Fix: Fernández
Name: "Gra\xf1a" : (bad UTF-8) : Latin-1-Fix: Graña
Name: "Mu\xf1oz" : (bad UTF-8) : Latin-1-Fix: Muñoz
Name: "P\xe9rez" : (bad UTF-8) : Latin-1-Fix: Pérez
Name: "Garc\xeda" : (bad UTF-8) : Latin-1-Fix: García
Name: "Gro\xdfmann" : (bad UTF-8) : Latin-1-Fix: Großmann
Name: "Ure\xf1a" : (bad UTF-8) : Latin-1-Fix: Ureña
Name: "Iba\xf1ez" : (bad UTF-8) : Latin-1-Fix: Ibañez
Name: "Nu\xf1ez" : (bad UTF-8) : Latin-1-Fix: Nuñez
Name: "Ba\xd1on" : (bad UTF-8) : Latin-1-Fix: BaÑon
Name: "Gonz\xe1lez" : (bad UTF-8) : Latin-1-Fix: González
Name: "Garc\xeda" : (bad UTF-8) : Latin-1-Fix: García
Name: "Guti\xe9rrez" : (bad UTF-8) : Latin-1-Fix: Gutiérrez
Name: "D\xedaz" : (bad UTF-8) : Latin-1-Fix: Díaz
Name: "Encarnaci\xf3n" : (bad UTF-8) : Latin-1-Fix: Encarnación