使用Go解码文本时会忽略非法字节吗？

I'm converting a Go program that decodes email messages. It currently runs iconv to do the actual decoding, which of course has overhead. I would like to use the golang.org/x/text/transform and golang.org/x/net/html/charset packages to do this. Here is working code:

// cs is the charset that the email body is encoded with, pulled from
// the Content-Type declaration.
enc, name := charset.Lookup(cs)
if enc == nil {
    log.Fatalf("Can't find %s", cs)
}
// body is the email body we're converting to utf-8
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())

// result contains the converted-to-utf8 email body
result, err := ioutil.ReadAll(r)

That works great except for when it encounters illegal bytes, which unfortunately is not an uncommon experience when dealing with email in the wild. ioutil.ReadAll() returns an error and all the converted bytes up until the problem. Is there a way to tell the transform package to ignore illegal bytes? Right now, we use the -c flag to iconv to do that. I've gone through the docs for the transform package, and I can't tell if it's possible or not.

UPDATE: Here's a test program that shows the problem (the Go playground doesn't have the charset or transform packages...). The raw text was taken from an actual email. Yep, it's in English, and yep, the charset in the email was set to EUC-KR. I need it to ignore that apostrophe.

package main

import (
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/net/html/charset"
    "golang.org/x/text/transform"
)

func main() {
    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
    result, err := ioutil.ReadAll(r)
    if err != nil {
        log.Printf("ReadAll returned %s", err)
    }
    log.Printf("RESULT: '%s'", string(result))
}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

dream518518518 2015-09-11 18:02

关注

Here is the solution I went with. Instead of using a Reader, I allocate the destination buffer by hand and call the Transform() function directly. When Transform() errors out, I check for a short destination buffer, and reallocate if necessary. Otherwise I skip a rune, assuming that it is the illegal character. For completeness, I should also check for a short input buffer, but I do not do so in this example.

raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
enc, _ := charset.Lookup("euc-kr")
dst := make([]byte, len(raw))
d := enc.NewDecoder()

var (
    in  int
    out int
)
for in < len(raw) {
    // Do the transformation
    ndst, nsrc, err := d.Transform(dst[out:], []byte(raw[in:]), true)
    in += nsrc
    out += ndst
    if err == nil {
        // Completed transformation
        break
    }
    if err == transform.ErrShortDst {
        // Our output buffer is too small, so we need to grow it
        log.Printf("Short")
        t := make([]byte, (cap(dst)+1)*2)
        copy(t, dst)
        dst = t
        continue
    }
    // We're here because of at least one illegal character. Skip over the current rune
    // and try again.
    _, width := utf8.DecodeRuneInString(raw[in:])
    in += width
}

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

使用Go解码文本时会忽略非法字节吗？
2015-09-10 22:15

回答 2 已采纳 Here is the solution I went with. Instead of using a Reader, I allocate the destination buffer by
JSON解码器会忽略结构字段标记？ json
2017-10-15 19:30

回答 1 已采纳 You need to export your dataSourceName field as encoding/json package requires them to be so
Go中的JSON解码会更改对象类型吗？ json
2015-08-19 13:02

回答 1 已采纳 You should use pointer instead of struct: func CreateObject() interface{} { return &MyType{}
Golang_02: Go语言数据类型：基础类型与复合类型
2023-05-25 21:20

谢TS的博客其中基础类型、复合类型是常用的数据结构类型，可细分为：Go 使用关键字声明变量，格式：其中类型和表达式赋值可以省略一个，但不能都省略（需要能够推导出变量的类型）。如果省略初始化赋值表达式，则...
spyder报错'gbk'编解码器不能解码字节0x97在位置200:非法的多字节序列，编码解码问题，用超集也没用 python
2022-01-18 10:33

回答 2 已采纳 open函数部分都加入编码参数encoding=“utf8”试试
有没有一种方法可以在Go中使用转换类型来解码JSON？ json
2018-12-08 12:26

回答 3 已采纳 Solved like this: type Int64 struct { Value int64 } func (this *Int64) UnmarshalJSON(byt
使用Go图像库从stdout解码bmp图像
2019-02-10 16:33

回答 1 已采纳 Update: turns out the issue was -ss takes time not frame index. I tried to reproduce the issue bu
【Go语言入门教程】Go语言基本语法
2022-02-08 19:23

小熊coder的博客文章目录Go语言变量的声明（使用var关键字）标准格式批量格式简短格式Go语言变量的初始化回顾C语言变量初始化的标准格式编译器推导类型的格式短变量声明并初始化Go语言多个变量同时赋值Go语言匿名变量（没有名字的...
为什么使用unix-compress和go compress / lzw会生成其他解码器无法读取的不同文件？
2017-03-19 17:22

回答 2 已采纳 A .Z file does not only contain LZW compressed data, there is also a 3-bytes header that the Go LZ
使用Go解码JSON json
2017-11-22 08:33

回答 3 已采纳 As pointed out in the comments you need to pass a pointer to getJson so that it can actually modif
我可以仅部分解码JSON（golang）吗？
2014-05-09 07:59

回答 1 已采纳 Yes, you can just mention the fields you are interested in and the decoder will ignore any others,
go语言编码库encoding
2022-12-19 14:56

Generalzy的博客 encoding包定义了供其它包使用的可以将数据在字节水平和文本表示之间转换的接口。encoding/gob、encoding/json、encoding/xml三个包都会检查使用这些接口。
如何在golang中使用表情符号处理（解码或删除无效的Unicode代码点）字符串？
2018-10-18 17:30

回答 1 已采纳 Well, probably not so simple as neither \ud83d nor \udcf8 are valid code points but together are a
从0到1 ▏Netty编解码框架之多种常用解码器使用示例解析
2016-05-25 15:59

VCHH的博客通常我们也习惯将编码（Encode）称为序列化（serialization），它将对象...反之，解码（Decode）/反序列化（deserialization）把从网络、磁盘等读取的字节数组还原成原始对象（通常是原始对象的拷贝），以方便后续
go语言学习
2022-05-06 14:47

tomyyyyy的博客 Go语言是静态类型语言，因此变量(variable)是有明确类型的，编译器也会检查变量类型的正确性。我们从计算机系统的角度来讲，变量就是一段或者多段内存，用于存储数据 1.1.1 标准格式 var 变量名变量类型变量声明...
没有解决我的问题, 去提问

悬赏问题

¥15 韩国网站购物，KG支付的支付回调如何解决
¥15 workstation导入ovf文件，报错，怎么解决呢？
¥15 关于#c语言#的问题：构成555单稳态触发器，采用LED指示灯延时时间，对延时时间进行测量并显示（如楼道声控延时灯）需要Proteus仿真图和C语言代码
¥15 workstation加载centos进入emergency模式，查看日志报警如图，怎样解决呢？
¥50 如何用单纯形法寻优不能精准找不到给定的参数，并联机构误差识别，给定误差有7个？matlab
¥15 workstation加载centos进入emergency模式，查看日志报警如图，没有XFS,怎样解决呢？
¥15 应用商店如何检测在架应用内容是否违规？
¥15 Ubuntu系统配置PX4
¥50 nw.js调用activex
¥15 数据库获取信息反馈出错，直接查询了ref字段并且还使用了User文档的_id而不是自己的

码龄粉丝数原力等级 --

使用Go解码文本时会忽略非法字节吗？

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

使用Go解码文本时会忽略非法字节吗？

2条回答 默认 最新

悬赏问题

2条回答默认最新