I'm converting a Go program that decodes email messages. It currently runs iconv to do the actual decoding, which of course has overhead. I would like to use the golang.org/x/text/transform
and golang.org/x/net/html/charset
packages to do this. Here is working code:
// cs is the charset that the email body is encoded with, pulled from
// the Content-Type declaration.
enc, name := charset.Lookup(cs)
if enc == nil {
log.Fatalf("Can't find %s", cs)
}
// body is the email body we're converting to utf-8
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())
// result contains the converted-to-utf8 email body
result, err := ioutil.ReadAll(r)
That works great except for when it encounters illegal bytes, which unfortunately is not an uncommon experience when dealing with email in the wild. ioutil.ReadAll() returns an error and all the converted bytes up until the problem. Is there a way to tell the transform package to ignore illegal bytes? Right now, we use the -c flag to iconv to do that. I've gone through the docs for the transform package, and I can't tell if it's possible or not.
UPDATE: Here's a test program that shows the problem (the Go playground doesn't have the charset or transform packages...). The raw text was taken from an actual email. Yep, it's in English, and yep, the charset in the email was set to EUC-KR. I need it to ignore that apostrophe.
package main
import (
"io/ioutil"
"log"
"strings"
"golang.org/x/net/html/charset"
"golang.org/x/text/transform"
)
func main() {
raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
enc, _ := charset.Lookup("euc-kr")
r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
result, err := ioutil.ReadAll(r)
if err != nil {
log.Printf("ReadAll returned %s", err)
}
log.Printf("RESULT: '%s'", string(result))
}