For some time I've been normalizing & de-accenting text by doing:
// Local helper function for normalization of UTF8 strings.
func isMn (r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
// This map is used by RemoveAccents function to convert non-accented characters.
var transliterations = map[rune]string{'Æ':"E",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"e",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}
// removeAccentsBytes converts accented UTF8 characters into their non-accented equivalents, from a []byte.
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
for i, w := 0, 0; i < len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if r=='-' {
tlBuf.WriteByte(' ')
} else {
if d, ok := transliterations[r]; ok {
tlBuf.WriteString(d)
} else {
tlBuf.WriteRune(r)
}
}
w = width
}
return tlBuf.Bytes(), nil
}
After that I lowercase the whole thing and apply a series of regular expressions.
This way of doing it is very heavy. I reckon I should be able to do the entire thing in one loop over the bytes, instead of 10 loops, plus the regular expressions are slow.
My first thought was to modify the above function to do the lowercasings directly in the loop (the second part of the removeAccentsBytes function). But then I decided I'd like to combine it all into a single loop, including the transform function.
On this I first tried to get the transformation tables out of the transform source, then by copying and modifying it, but I can't seem to get it to give me whatever tables it's using for the transformation. It turns out that even norm.NFD = 1 and norm.NFC = 0, and I have yet to figure out how its parsing the fact that the paramters are 0 or 1 and somehow getting a transformation table out of this.
Reading its code I can see it's written efficiently anyway, and obviously beyond by beginner's Go skills, so I thought it might be better to use transform.Chain to add in my own transformers.
I can't find any instructions anywhere on how to write a transformer that will be accepted by transform.Chain. Nothing.
Does anyone have any information on how I can make a transformer for this?