前往：为code.google.com/p/go.text/transform制作变压器

For some time I've been normalizing & de-accenting text by doing:

// Local helper function for normalization of UTF8 strings.
func isMn (r rune) bool {
        return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
    }

// This map is used by RemoveAccents function to convert non-accented characters.
var transliterations = map[rune]string{'Æ':"E",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"e",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}

//  removeAccentsBytes converts accented UTF8 characters into their non-accented equivalents, from a []byte.
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
    mnBuf := make([]byte, len(b))
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    n, _, err := t.Transform(mnBuf, b, true)
    if err != nil {
        return nil, err
    }
    mnBuf = mnBuf[:n]
    tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
    for i, w := 0, 0; i < len(mnBuf); i += w {
        r, width := utf8.DecodeRune(mnBuf[i:])
        if r=='-' {
            tlBuf.WriteByte(' ')
        } else {
            if d, ok := transliterations[r]; ok {
                tlBuf.WriteString(d)
            } else {
                tlBuf.WriteRune(r)
            }
        }
        w = width
    }
    return tlBuf.Bytes(), nil
}

After that I lowercase the whole thing and apply a series of regular expressions.

This way of doing it is very heavy. I reckon I should be able to do the entire thing in one loop over the bytes, instead of 10 loops, plus the regular expressions are slow.

My first thought was to modify the above function to do the lowercasings directly in the loop (the second part of the removeAccentsBytes function). But then I decided I'd like to combine it all into a single loop, including the transform function.

On this I first tried to get the transformation tables out of the transform source, then by copying and modifying it, but I can't seem to get it to give me whatever tables it's using for the transformation. It turns out that even norm.NFD = 1 and norm.NFC = 0, and I have yet to figure out how its parsing the fact that the paramters are 0 or 1 and somehow getting a transformation table out of this.

Reading its code I can see it's written efficiently anyway, and obviously beyond by beginner's Go skills, so I thought it might be better to use transform.Chain to add in my own transformers.

I can't find any instructions anywhere on how to write a transformer that will be accepted by transform.Chain. Nothing.

Does anyone have any information on how I can make a transformer for this?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongyan1491 2014-08-20 18:28
关注
transform.Chain

func Chain(t ...Transformer) Transformer

takes an array of transform.Transformer

type Transformer interface { Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) }

so you just need to create a type that implements the Transformer interface:

type DenormalizeAndDeaccent struct { } func (t *DenomarlizeAndDeaccent) Transform(dst, src []byte, atEOF bool) (int, int, error) { result, err := removeAccentsBytesDashes(src) if err != nil { return 0, 0, nil } n := copy(dst, result) if n < len(src) { err = ErrShortDst } return n, len(src), err }
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

前往：为code.google.com/p/go.text/transform制作变压器
2014-08-20 11:28

回答 1 已采纳 transform.Chain func Chain(t ...Transformer) Transformer takes an array of transform.Transform
go程序抛出错误，找不到软件包“ code.google.com/p/go.net/websocket”
2015-06-06 19:18

回答 1 已采纳 How about this? $ go get -v golang.org/x/net/websocket golang.org/x/net/websocket $ - package
GO：使用GOLANG中的https://github.com/tealeg/xlsx库解析Excel数据的字节数组
2019-04-24 05:57

回答 2 已采纳 Got it. Used func OpenBinary(bs []byte) (*File, error)
（八：2020.08.27）CVPR 2020 追踪之论文纲要（译）
2020-08-27 17:28

花卷汤圆的博客 CVPR 2019 追踪之论文纲要（修正于2020.08.27）...论文链接建议直接Google论文名，比去各种论文或顶会网站找不知道快捷多少。 Respect！论文目录论文概要 12-in-1 - Multi-Task Vision and Language Repre
导入错误：带有html的golang.org/x/net/html html
2019-02-23 17:29

回答 2 已采纳 Using _ "golang.org/x/net/html" you import the package but you cut-off all access to it, this is u
golang.org/x/sys/unix缺少功能主体的问题 docker linux
2018-01-30 09:01

回答 1 已采纳 Golang dep command copy many files to the vendor folder which I wanted to skip with .gitignore rul
找不到code.google.com/p/go.crypto/pbkdf2文件？
2014-08-13 10:37

回答 1 已采纳 That means one of the dependencies (here code.google.com/p/go.crypto/pbkdf2)is in a Mercurial repo
Python 人工智能：11~15
2023-04-15 22:26

绝不原创的飞龙的博客现在，让我们将这些技术视为通过将选定的个体视为父代来创造下一代的机制。一旦执行重组和突变，我们将创建一组新的个体，这些个体将与旧个体竞争下一代的位置。通过抛弃最弱的个体并用后代代替它们，我们正在...
意外的模块路径“ github.com/sirupsen/logrus”
2019-04-05 14:10

回答 1 已采纳 I've found the solution: I've replace in go.mod replace ( github.com/Sirupsen/logrus v1.3.0
如何避免找不到包“ github.com/golang/protobuf/jsonpb”错误 docker
2018-11-02 13:12

回答 1 已采纳 Make sure you installed all your package inside container. Because your docker container is a diff
找不到导入：“ google.golang.org/cloud/storage”
2015-03-14 21:15

回答 1 已采纳 Google has change thire repository and i missed it... google-api-go-client So the solution is
BlackArch-Tools
2019-06-24 19:31

sztomarch的博客 automation 自动化 google-explorer Google mass exploit robot - Make a google search, and parse the results for a especific exploit you define. Google大规模攻击机器人 - 进行谷歌搜索，并解析结果以获得您...
golang测试错误：在以下任何位置都找不到软件包“ github.com/stretchr/testify/assert”：
2017-02-15 16:11

回答 1 已采纳 Moving the answer here for others in the future. You need to both import the package and run go g
nlp自然语言处理_变压器的NLP上升
2020-09-04 17:43

杨_明的博客 nlp自然语言处理A quick discussion on the recent progression of NLP, the fall of LSTM, and an introduction to (and tutorial for) BERT from Google快速讨论NLP的最新进展，LSTM的衰落，以及Google的BERT简介...
NLP之T5：T5的简介(论文Exploring the Limits of Transfer Learning with a Unified Text-to-T)、安装和使用方法、案例应用之详细攻略
2019-11-07 22:55

一个处女座的程序猿的博客 NLP之T5：T5的简介(论文Exploring the Limits of Transfer Learning with a Unified Text-to-T)、安装和使用方法、案例应用之详细攻略目录相关论文 T5的简介 T5的安装和使用方法 T5的...
没有解决我的问题, 去提问

悬赏问题

¥15 聚类分析或者python进行数据分析
¥15 逻辑谓词和消解原理的运用
¥15 三菱伺服电机按启动按钮有使能但不动作
¥15 js，页面2返回页面1时定位进入的设备
¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
¥15 （希望可以解决问题）ma和mb文件无法正常打开，打开后是空白，但是有正常内存占用，但可以在打开Maya应用程序后打开场景ma和mb格式。
¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
¥20 腾讯企业邮箱邮件可以恢复么
¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗？
¥15 错误 LNK2001 无法解析的外部符号

前往：为code.google.com/p/go.text/transform制作变压器

1条回答 默认 最新

悬赏问题

1条回答默认最新