dongtie0929 2016-01-30 02:30
浏览 221

电子邮件主题,采用不同字符集(例如ISO-2022-JP,GB-2312等)的标头解码

I am working on a project which needs to deal with email encoding/decoding in different charsets. A python code for this can be shown in the below:

from email.header import Header, decode_header, make_header
from charset import text_to_utf8    

class ....
def decode_header(self, header):
    decoded_header = decode_header(header)

    if decoded_header[0][1] is None:
        return text_to_utf8(decoded_header[0][0]).decode("utf-8", "replace")
    else:
        return decoded_header[0][0].decode(decoded_header[0][1].replace("windows-", "cp"), "replace")

Basically, for the text like "=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?="; the "decode_header" function just tries to find the encoding: 'iso-2022-jp'; then it will use the "decode" function to decode the charset to unicode.

Now, in go, i can do something similar to like:

import "mime"

dec := new(mime.WordDecoder)
text := "=?utf-8?q?=C3=89ric?= <eric@example.org>, =?utf-8?q?Ana=C3=AFs?= <anais@example.org>"
header, err := dec.DecodeHeader(text)

Seems that there mime.WordDecoder allow to put a charset decoder "hook": 
type WordDecoder struct {
   // CharsetReader, if non-nil, defines a function to generate
   // charset-conversion readers, converting from the provided
   // charset into UTF-8.
   // Charsets are always lower-case. utf-8, iso-8859-1 and us-ascii charsets
   // are handled by default.
   // One of the the CharsetReader's result values must be non-nil.
   CharsetReader func(charset string, input io.Reader) (io.Reader, error)
}           

I am wondering is there any library which can allow me to convert arbitrary charset like the "decode" function in python as shown in the above example. I don't want to write a big "switch-case"like the one being used in mime/encodedword.go:

func (d *WordDecoder) convert(buf *bytes.Buffer, charset string, content []byte) error {
   switch {
   case strings.EqualFold("utf-8", charset):
      buf.Write(content)
   case strings.EqualFold("iso-8859-1", charset):
      for _, c := range content {
         buf.WriteRune(rune(c))
      }
....

Any help would be very appreciated.

Thanks.

  • 写回答

2条回答 默认 最新

  • dqa35710 2016-01-30 08:28
    关注

    I'm not sure it is what you are looking for but there is golang.org/x/text package which I'm using to convert Windows-1251 to UTF-8. Code looks like

    import (
        "golang.org/x/text/encoding/charmap"
        "golang.org/x/text/transform"
        "io/ioutil"
        "strings"
    )
    
    func convert(s string) string {
        sr := strings.NewReader(s)
        tr := transform.NewReader(sr, charmap.Windows1251.NewDecoder())
        buf, err := ioutil.ReadAll(tr)
        if err != nil {
            return ""
        }
        return string(buf)
    }
    

    I think in your case if you want to avoid "a big 'switch-case'" you can create kind of map with full list of available encodings and just make something like:

    var encodings = map[string]transform.Transformer{
        "win-1251": charmap.Windows1251.NewDecoder(),
    }
    
    func convert(s, charset string) string {
        buf, err := ioutil.ReadAll(transform.NewReader(strings.NewReader(s), encodings[charset]))
        if err != nil {
            return ""
        }
        return string(buf)
    }
    
    评论

报告相同问题?

悬赏问题

  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)