douwei8295 2019-09-19 18:59
浏览 206
已采纳

清理错误的UTF-8字符串

My gRPC service failed to send a request due to malformed user-data. Turns out the HR user-data has a bad UTF-8 string and gRPC could not encode it. I narrowed the bad field down to this string:

"Gr\351gory Smith" // Gr�gory Smith  (this is coming from an LDAP source)

So I want a way to sanitized such inputs should they contain bad UTF-8 encodings.

Not seeing any obvious sanitization functions in the unicode/utf8 standard package, here's my first naïve attempt:

func naïveSanitizer(in string) (out string) {
    for _, rune := range in {
        out += string(rune)
    }
    return
}

Output:

Before: Valid UTF-8? false  Name: 'Gr�gory Smith' Byte-Count:  13
After:  Valid UTF-8? true   Name: 'Gr�gory Smith' Byte-Count:  15

Playground version

Is there a better or more standard way to salvage as much valid data from a bad UTF-8 string?


The reason I have pause here is because while iterating the string and the bad (3rd) character is encountered, utf8.ValidRune(rune) returns true: https://play.golang.org/p/_FZzeTRLVls

So my follow-up question is, will iterating a string - one rune at a time - will the rune value always be valid? Even though the underlying source string encoding was malformed?


EDIT:

Just to clarify, this data is coming from an LDAP source: 500K user records. Of those 500K records only 15 (fifteen) i.e. ~0.03% return a uf8.ValidString(...) of false.

As @kostix and @peterSO have pointed out, the values may be valid if converted from another encoding (e.g. Latin-1) to UTF-8. Applying this theory to these outlier samples:

https://play.golang.org/p/9BA7W7qQcV3

Name:     "Jean-Fran\u00e7ois Smith" : (good UTF-8) :            : Jean-François Smith
Name:                   "Gr\xe9gory" : (bad  UTF-8) : Latin-1-Fix: Grégory
Name:               "Fr\xe9d\xe9ric" : (bad  UTF-8) : Latin-1-Fix: Frédéric
Name:                 "Fern\xe1ndez" : (bad  UTF-8) : Latin-1-Fix: Fernández
Name:                     "Gra\xf1a" : (bad  UTF-8) : Latin-1-Fix: Graña
Name:                     "Mu\xf1oz" : (bad  UTF-8) : Latin-1-Fix: Muñoz
Name:                     "P\xe9rez" : (bad  UTF-8) : Latin-1-Fix: Pérez
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                  "Gro\xdfmann" : (bad  UTF-8) : Latin-1-Fix: Großmann
Name:                     "Ure\xf1a" : (bad  UTF-8) : Latin-1-Fix: Ureña
Name:                    "Iba\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Ibañez
Name:                     "Nu\xf1ez" : (bad  UTF-8) : Latin-1-Fix: Nuñez
Name:                     "Ba\xd1on" : (bad  UTF-8) : Latin-1-Fix: BaÑon
Name:                  "Gonz\xe1lez" : (bad  UTF-8) : Latin-1-Fix: González
Name:                    "Garc\xeda" : (bad  UTF-8) : Latin-1-Fix: García
Name:                 "Guti\xe9rrez" : (bad  UTF-8) : Latin-1-Fix: Gutiérrez
Name:                      "D\xedaz" : (bad  UTF-8) : Latin-1-Fix: Díaz
Name:               "Encarnaci\xf3n" : (bad  UTF-8) : Latin-1-Fix: Encarnación
  • 写回答

3条回答 默认 最新

  • duanqian6295 2019-09-19 19:34
    关注

    You could improve your "sanitiser" by dropping invalid runes:

    package main
    
    import (
        "fmt"
        "strings"
    )
    
    func notSoNaïveSanitizer(s string) string {
        var b strings.Builder
        for _, c := range s {
            if c == '\uFFFD' {
                continue
            }
            b.WriteRune(c)
        }
        return b.String()
    }
    
    func main() {
        fmt.Println(notSoNaïveSanitizer("Gr\351gory Smith"))
    }
    

    Playground.

    The problem though is that \351 is the character é in Latin-1.

    @PeterSO pointed out it also happens to be at the same position in the Unicode's BMP, and that is correct but Unicode is not an encoding, and your data is supposedly encoded, so I think you just have an incorrect assumption about the encoding of your data and it's not UTF-8 but rather Latin-1 (or something compatible with regard to Latin accented letters).

    So I'd verify you really are dealing with Latin-1 (or whatever) and if so, golang.org/x/text/encoding provides complete tooling for re-encoding from legacy encodings to UTF-8 (or whatever).

    (On a side note, you might as well just not happen to explicitly ask your data source to provide you with UTF-8-encoded data.)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
  • drn61317 2019-09-19 19:22
    关注

    Fix your problem. \351 is the octal value of Unicode code point é.

    package main
    
    import "fmt"
    
    func main() {
        fmt.Println(string(rune(0351)))
        fullname := "Grégory Smith" // "Gr\351gory Smith"
        fmt.Println(fullname)
    }
    

    Playground: https://play.golang.org/p/WigFZk3iSK1

    Output:

    é
    Grégory Smith
    
    评论
  • dongzou3751 2019-09-19 19:49
    关注

    Go 1.13 introduces strings.ToValidUTF8(), so sanitizer() should simply be:

    func sanitize(s string) string {
        return strings.ToValidUTF8(s, "")
    }
    

    Which I don't even think deserves its own function. Try it on the Go Playground.

    If your input happens to be a byte slice, you may use the similar bytes.ToValidUTF8() function.

    Also note that if you don't just want to discard some data from your input without a trail, you may use any replacement character (or multiple characters) when calling strings.ToValidUTF8(), for example:

    return strings.ToValidUTF8(in, "❗")
    

    Try this one on the Go Playground.

    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 字符串的比较老是报错
  • ¥15 很简单的c#代码补全
  • ¥15 复杂表达式求值程序里的函数优先级问题
  • ¥15 求密码学的客成社记ji着用
  • ¥35 POI导入树状结构excle
  • ¥15 初学者c语言题目解答
  • ¥15 div editable中的光标问题
  • ¥15 mysql报错1415Not allowed to return a result set from a trigger 不知如何修改
  • ¥60 Python输出Excel数据整理,算法较为复杂
  • ¥15 回答几个问题 关于数据库