doujing5435 2015-09-10 22:15
浏览 42
已采纳

使用Go解码文本时会忽略非法字节吗?

I'm converting a Go program that decodes email messages. It currently runs iconv to do the actual decoding, which of course has overhead. I would like to use the golang.org/x/text/transform and golang.org/x/net/html/charset packages to do this. Here is working code:

// cs is the charset that the email body is encoded with, pulled from
// the Content-Type declaration.
enc, name := charset.Lookup(cs)
if enc == nil {
    log.Fatalf("Can't find %s", cs)
}
// body is the email body we're converting to utf-8
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())

// result contains the converted-to-utf8 email body
result, err := ioutil.ReadAll(r)

That works great except for when it encounters illegal bytes, which unfortunately is not an uncommon experience when dealing with email in the wild. ioutil.ReadAll() returns an error and all the converted bytes up until the problem. Is there a way to tell the transform package to ignore illegal bytes? Right now, we use the -c flag to iconv to do that. I've gone through the docs for the transform package, and I can't tell if it's possible or not.

UPDATE: Here's a test program that shows the problem (the Go playground doesn't have the charset or transform packages...). The raw text was taken from an actual email. Yep, it's in English, and yep, the charset in the email was set to EUC-KR. I need it to ignore that apostrophe.

package main

import (
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/net/html/charset"
    "golang.org/x/text/transform"
)

func main() {
    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
    result, err := ioutil.ReadAll(r)
    if err != nil {
        log.Printf("ReadAll returned %s", err)
    }
    log.Printf("RESULT: '%s'", string(result))
}
  • 写回答

2条回答 默认 最新

  • dream518518518 2015-09-11 18:02
    关注

    Here is the solution I went with. Instead of using a Reader, I allocate the destination buffer by hand and call the Transform() function directly. When Transform() errors out, I check for a short destination buffer, and reallocate if necessary. Otherwise I skip a rune, assuming that it is the illegal character. For completeness, I should also check for a short input buffer, but I do not do so in this example.

    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    dst := make([]byte, len(raw))
    d := enc.NewDecoder()
    
    var (
        in  int
        out int
    )
    for in < len(raw) {
        // Do the transformation
        ndst, nsrc, err := d.Transform(dst[out:], []byte(raw[in:]), true)
        in += nsrc
        out += ndst
        if err == nil {
            // Completed transformation
            break
        }
        if err == transform.ErrShortDst {
            // Our output buffer is too small, so we need to grow it
            log.Printf("Short")
            t := make([]byte, (cap(dst)+1)*2)
            copy(t, dst)
            dst = t
            continue
        }
        // We're here because of at least one illegal character. Skip over the current rune
        // and try again.
        _, width := utf8.DecodeRuneInString(raw[in:])
        in += width
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 韩国网站购物,KG支付的支付回调如何解决
  • ¥15 workstation导入ovf文件,报错,怎么解决呢?
  • ¥15 关于#c语言#的问题:构成555单稳态触发器,采用LED指示灯延时时间,对延时时间进行测量并显示(如楼道声控延时灯)需要Proteus仿真图和C语言代码
  • ¥15 workstation加载centos进入emergency模式,查看日志报警如图,怎样解决呢?
  • ¥50 如何用单纯形法寻优不能精准找不到给定的参数,并联机构误差识别,给定误差有7个?matlab
  • ¥15 workstation加载centos进入emergency模式,查看日志报警如图,没有XFS,怎样解决呢?
  • ¥15 应用商店如何检测在架应用内容是否违规?
  • ¥15 Ubuntu系统配置PX4
  • ¥50 nw.js调用activex
  • ¥15 数据库获取信息反馈出错,直接查询了ref字段并且还使用了User文档的_id而不是自己的