从字符串中删除无效的UTF-8字符

I get this on json.Marshal of a list of strings:

json: invalid UTF-8 in string: "...ole\xc5\"

The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.

In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?

UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.

(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dougou6213 2013-12-05 14:56
关注
For example,

package main import ( "fmt" "unicode/utf8" ) func main() { s := "a\xc5z" fmt.Printf("%q ", s) if !utf8.ValidString(s) { v := make([]rune, 0, len(s)) for i, r := range s { if r == utf8.RuneError { _, size := utf8.DecodeRuneInString(s[i:]) if size == 1 { continue } } v = append(v, r) } s = string(v) } fmt.Printf("%q ", s) }

Output:

"a\xc5z" "az"

Unicode Standard

FAQ - UTF-8, UTF-16, UTF-32 & BOM

Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?

A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

从字符串中删除无效的UTF-8字符 json
2013-12-05 13:56

回答 2 已采纳 For example, package main import ( "fmt" "unicode/utf8" ) func main() { s := "a\xc5
VB将汉字字符串转换成 UTF-8格式
2015-11-29 13:11

回答 1 已采纳 http://www.williamlong.info/archives/1136.html
清理错误的UTF-8字符串
2019-09-19 18:59

回答 3 已采纳 You could improve your "sanitiser" by dropping invalid runes: package main import ( "fmt"
如何去除utf-8字符串里头的非法字符
2014-01-10 19:17

RomanBrickie的博客在开发的过程中碰到了在utf-8的字符串里头有非法字符的问题，搜了下，有不少人遇到了相同的问题。有iconv.open("UTF-8", UTF-8//IGNORE") Table 3-7. Well-Formed UTF-8 Byte Sequences Code Points First Byte ...
C#将String默认的字符编码改为UTF-8 asp.net c#
2020-06-16 20:38

回答 1 已采纳 ``` public static string utf8_gb2312(string text) { //声明字符集 System.Text
如何去掉字符串中的非UTF-8编码？ java
2012-08-08 15:34

回答 2 已采纳结果 [code="java"]测{方块}试[/code] 只有%00是方块，对头的吧？我的建议也是你自己说的用正则表达式把非法字符过滤掉。看你的描述似乎不确定非法字符有哪些，那么
VB如何正常获取UTF-8中文字符串长度 .net asp.net
2023-01-18 18:01

回答 3 已采纳使用 System.Text.Encoding.UTF8.GetByteCount() 方法，获取字符串的字节数。 Dim byteCount As Integer = System.Text.Enc
python打开中文文本utf-8用不了_简单解决Python文件中文编码问题
2021-01-29 22:07

火龙果派的博客 8编码的中文文件，先利用sublime text软件将它改成无DOM的编码，然后用以下代码：with codecs.open(note_path, 'r+','utf-8') as f:line=f.readline()print line这样就可以正确地读出文件里面的中文字符了。...
Golang：如何从C正确解析UTF-8字符串
2015-09-30 16:02

回答 1 已采纳 You don't need to do anything special. UTF-8 is Go's "native" character encoding, so you can use t
如何将json字符串的编码格式改成utf-8 json 有问必答
2021-09-22 16:44

回答 1 已采纳你这是获取utf-8的字节内容。用编码试试 URLEncoder.encode(str,"UTF-8");
PHP：使用过滤器删除XML中的无效utf-8字符 php xml
2010-11-19 10:32

回答 1 已采纳 No, I don't think it will work. It will strip valid sequences of code units that happen to be spli
Java中文乱码改完UTF-8后依然还是乱码
2021-01-15 09:09

weixin_42647531的博客这时候怎样都还是乱码的话，可以把需要修改的类复制到其他地方，然后用txt打开。打开后不是乱码，复制粘贴到原来的类里就行了。
如何在PHP中使用COM对象获取UTF-8字符串？ php
2017-11-06 12:40

回答 1 已采纳 Well, answer was right in front of my eyes, I just overlooked: COM::__construct ( string $module_
解决 IIS 简单的网页端，本地代码内容中，中文乱码问题 utf-8
2022-10-18 14:35

LARALOY的博客 Win10 IIS 服务端，在全球化修改utf-8的无效情况下，解决代码中中文乱码问题。
C语言字符串移动包含问题,ungetc--C语言中处理字符串常碰到的问题
2021-05-19 09:09

华丰卫浴的博客如图，在学习C++速成课的时候发现了这个神奇的函数ungetc()，视频的UP主给的注释是将变量(字符串)中存放的字符退回给stdin输入流。这是什么意思看UP主的函数在上面getchar()是用来吃空格的，当输入一组字符串(22 33...
没有解决我的问题, 去提问

悬赏问题

¥15 java 操作 elasticsearch 8.1 实现索引的重建
¥15 数据可视化Python
¥15 要给毕业设计添加扫码登录的功能！！有偿
¥15 kafka 分区副本增加会导致消息丢失或者不可用吗？
¥15 微信公众号自制会员卡没有收款渠道啊
¥15 stable diffusion
¥100 Jenkins自动化部署—悬赏100元
¥15 关于#python#的问题：求帮写python代码
¥20 MATLAB画图图形出现上下震荡的线条
¥15 关于#windows#的问题：怎么用WIN 11系统的电脑克隆WIN NT3.51-4.0系统的硬盘

从字符串中删除无效的UTF-8字符

2条回答 默认 最新

悬赏问题

2条回答默认最新