转：从[] byte转换为字符串，反之亦然

I always seem to be converting strings to []byte to string again over and over. Is there a lot of overhead with this? Is there a better way?

For example, here is a function that accepts a UTF8 string, normalizes it, remove accents, then converts special characters to ASCII equivalent:

var transliterations = map[rune]string{'Æ':"AE",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"ae",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}
func RemoveAccents(s string) string {
    b := make([]byte, len(s))
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    _, _, e := t.Transform(b, []byte(s), true)
    if e != nil { panic(e) }
    r := string(b)

    var f bytes.Buffer
    for _, c := range r {
        temp := rune(c)
        if val, ok := transliterations[temp]; ok {
            f.WriteString(val)
        } else {
            f.WriteRune(temp)
        }
    }
    return f.String()
}

So I'm starting with a string because that's what I get, then I'm converting it to a byte array, then back to a string, then to a byte array again, then back to a string again. Surely this is unnecessary but I can't figure out how to not do this..? And does it really have a lot of overhead or do I not have to worry about slowing things down with excessive conversions?

(Also if anyone has the time I've not yet figured out how bytes.Buffer actually works, would it not be better to initialize a buffer of 2x the size of the string, which is the maximum output size of the return value?)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

普通网友 2014-07-23 16:39

关注

In Go, strings are immutable so any change creates a new string. As a general rule, convert from a string to a byte or rune slice once and convert back to a string once. To avoid reallocations, for small and transient allocations, over-allocate to provide a safety margin if you don't know the exact number.

For example,

package main

import (
    "bytes"
    "fmt"
    "unicode"
    "unicode/utf8"

    "code.google.com/p/go.text/transform"
    "code.google.com/p/go.text/unicode/norm"
)

var isMn = func(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}

var transliterations = map[rune]string{
    'Æ': "AE", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th",
    'ß': "ss", 'æ': "ae", 'ð': "d", 'ł': "l", 'ø': "oe",
    'þ': "th", 'Œ': "OE", 'œ': "oe",
}

func RemoveAccents(b []byte) ([]byte, error) {
    mnBuf := make([]byte, len(b)*125/100)
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    n, _, err := t.Transform(mnBuf, b, true)
    if err != nil {
        return nil, err
    }
    mnBuf = mnBuf[:n]
    tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*125/100))
    for i, w := 0, 0; i < len(mnBuf); i += w {
        r, width := utf8.DecodeRune(mnBuf[i:])
        if s, ok := transliterations[r]; ok {
            tlBuf.WriteString(s)
        } else {
            tlBuf.WriteRune(r)
        }
        w = width
    }
    return tlBuf.Bytes(), nil
}

func main() {
    in := "test stringß"
    fmt.Println(in)
    inBytes := []byte(in)
    outBytes, err := RemoveAccents(inBytes)
    if err != nil {
        fmt.Println(err)
    }
    out := string(outBytes)
    fmt.Println(out)
}

Output: