dongyang7152 2013-12-05 13:56
浏览 232
已采纳

从字符串中删除无效的UTF-8字符

I get this on json.Marshal of a list of strings:

json: invalid UTF-8 in string: "...ole\xc5\"

The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.

In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?

UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.

(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

  • 写回答

2条回答 默认 最新

  • dougou6213 2013-12-05 14:56
    关注

    For example,

    package main
    
    import (
        "fmt"
        "unicode/utf8"
    )
    
    func main() {
        s := "a\xc5z"
        fmt.Printf("%q
    ", s)
        if !utf8.ValidString(s) {
            v := make([]rune, 0, len(s))
            for i, r := range s {
                if r == utf8.RuneError {
                    _, size := utf8.DecodeRuneInString(s[i:])
                    if size == 1 {
                        continue
                    }
                }
                v = append(v, r)
            }
            s = string(v)
        }
        fmt.Printf("%q
    ", s)
    }
    

    Output:

    "a\xc5z"
    "az"
    

    Unicode Standard

    FAQ - UTF-8, UTF-16, UTF-32 & BOM

    Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?

    A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

    A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 java 操作 elasticsearch 8.1 实现 索引的重建
  • ¥15 数据可视化Python
  • ¥15 要给毕业设计添加扫码登录的功能!!有偿
  • ¥15 kafka 分区副本增加会导致消息丢失或者不可用吗?
  • ¥15 微信公众号自制会员卡没有收款渠道啊
  • ¥15 stable diffusion
  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘