doujing5435 2015-09-10 22:15
浏览 42
已采纳

使用Go解码文本时会忽略非法字节吗?

I'm converting a Go program that decodes email messages. It currently runs iconv to do the actual decoding, which of course has overhead. I would like to use the golang.org/x/text/transform and golang.org/x/net/html/charset packages to do this. Here is working code:

// cs is the charset that the email body is encoded with, pulled from
// the Content-Type declaration.
enc, name := charset.Lookup(cs)
if enc == nil {
    log.Fatalf("Can't find %s", cs)
}
// body is the email body we're converting to utf-8
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())

// result contains the converted-to-utf8 email body
result, err := ioutil.ReadAll(r)

That works great except for when it encounters illegal bytes, which unfortunately is not an uncommon experience when dealing with email in the wild. ioutil.ReadAll() returns an error and all the converted bytes up until the problem. Is there a way to tell the transform package to ignore illegal bytes? Right now, we use the -c flag to iconv to do that. I've gone through the docs for the transform package, and I can't tell if it's possible or not.

UPDATE: Here's a test program that shows the problem (the Go playground doesn't have the charset or transform packages...). The raw text was taken from an actual email. Yep, it's in English, and yep, the charset in the email was set to EUC-KR. I need it to ignore that apostrophe.

package main

import (
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/net/html/charset"
    "golang.org/x/text/transform"
)

func main() {
    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
    result, err := ioutil.ReadAll(r)
    if err != nil {
        log.Printf("ReadAll returned %s", err)
    }
    log.Printf("RESULT: '%s'", string(result))
}
  • 写回答

2条回答 默认 最新

  • dream518518518 2015-09-11 18:02
    关注

    Here is the solution I went with. Instead of using a Reader, I allocate the destination buffer by hand and call the Transform() function directly. When Transform() errors out, I check for a short destination buffer, and reallocate if necessary. Otherwise I skip a rune, assuming that it is the illegal character. For completeness, I should also check for a short input buffer, but I do not do so in this example.

    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    dst := make([]byte, len(raw))
    d := enc.NewDecoder()
    
    var (
        in  int
        out int
    )
    for in < len(raw) {
        // Do the transformation
        ndst, nsrc, err := d.Transform(dst[out:], []byte(raw[in:]), true)
        in += nsrc
        out += ndst
        if err == nil {
            // Completed transformation
            break
        }
        if err == transform.ErrShortDst {
            // Our output buffer is too small, so we need to grow it
            log.Printf("Short")
            t := make([]byte, (cap(dst)+1)*2)
            copy(t, dst)
            dst = t
            continue
        }
        // We're here because of at least one illegal character. Skip over the current rune
        // and try again.
        _, width := utf8.DecodeRuneInString(raw[in:])
        in += width
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 iqoo11 如何下载安装工程模式
  • ¥15 本题的答案是不是有问题
  • ¥15 关于#r语言#的问题:(svydesign)为什么在一个大的数据集中抽取了一个小数据集
  • ¥15 C++使用Gunplot
  • ¥15 这个电路是如何实现路灯控制器的,原理是什么,怎么求解灯亮起后熄灭的时间如图?
  • ¥15 matlab数字图像处理频率域滤波
  • ¥15 在abaqus做了二维正交切削模型,给刀具添加了超声振动条件后输出切削力为什么比普通切削增大这么多
  • ¥15 ELGamal和paillier计算效率谁快?
  • ¥15 蓝桥杯单片机第十三届第一场,整点继电器吸合,5s后断开出现了问题
  • ¥15 file converter 转换格式失败 报错 Error marking filters as finished,如何解决?