dousou2911 2014-08-20 11:28
浏览 51
已采纳

前往:为code.google.com/p/go.text/transform制作变压器

For some time I've been normalizing & de-accenting text by doing:

// Local helper function for normalization of UTF8 strings.
func isMn (r rune) bool {
        return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
    }

// This map is used by RemoveAccents function to convert non-accented characters.
var transliterations = map[rune]string{'Æ':"E",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"e",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}

//  removeAccentsBytes converts accented UTF8 characters into their non-accented equivalents, from a []byte.
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
    mnBuf := make([]byte, len(b))
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    n, _, err := t.Transform(mnBuf, b, true)
    if err != nil {
        return nil, err
    }
    mnBuf = mnBuf[:n]
    tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
    for i, w := 0, 0; i < len(mnBuf); i += w {
        r, width := utf8.DecodeRune(mnBuf[i:])
        if r=='-' {
            tlBuf.WriteByte(' ')
        } else {
            if d, ok := transliterations[r]; ok {
                tlBuf.WriteString(d)
            } else {
                tlBuf.WriteRune(r)
            }
        }
        w = width
    }
    return tlBuf.Bytes(), nil
}

After that I lowercase the whole thing and apply a series of regular expressions.

This way of doing it is very heavy. I reckon I should be able to do the entire thing in one loop over the bytes, instead of 10 loops, plus the regular expressions are slow.

My first thought was to modify the above function to do the lowercasings directly in the loop (the second part of the removeAccentsBytes function). But then I decided I'd like to combine it all into a single loop, including the transform function.

On this I first tried to get the transformation tables out of the transform source, then by copying and modifying it, but I can't seem to get it to give me whatever tables it's using for the transformation. It turns out that even norm.NFD = 1 and norm.NFC = 0, and I have yet to figure out how its parsing the fact that the paramters are 0 or 1 and somehow getting a transformation table out of this.

Reading its code I can see it's written efficiently anyway, and obviously beyond by beginner's Go skills, so I thought it might be better to use transform.Chain to add in my own transformers.

I can't find any instructions anywhere on how to write a transformer that will be accepted by transform.Chain. Nothing.

Does anyone have any information on how I can make a transformer for this?

  • 写回答

1条回答 默认 最新

  • dongyan1491 2014-08-20 18:28
    关注

    transform.Chain

    func Chain(t ...Transformer) Transformer
    

    takes an array of transform.Transformer

    type Transformer interface {
        Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
    }
    

    so you just need to create a type that implements the Transformer interface:

    type DenormalizeAndDeaccent struct {
    }
    
    func (t *DenomarlizeAndDeaccent) Transform(dst, src []byte, atEOF bool) (int, int, error)   {
        result, err := removeAccentsBytesDashes(src)
        if err != nil {
            return 0, 0, nil
        }
        n := copy(dst, result)
        if n < len(src) {
            err = ErrShortDst
        }
        return n, len(src), err
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 聚类分析或者python进行数据分析
  • ¥15 逻辑谓词和消解原理的运用
  • ¥15 三菱伺服电机按启动按钮有使能但不动作
  • ¥15 js,页面2返回页面1时定位进入的设备
  • ¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
  • ¥15 (希望可以解决问题)ma和mb文件无法正常打开,打开后是空白,但是有正常内存占用,但可以在打开Maya应用程序后打开场景ma和mb格式。
  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号