dow66098 2018-09-18 23:07
浏览 71
已采纳

字符串切片是否执行基础数据的复制?

I am trying to efficiently count runes from a utf-8 string using the utf8 library. Is this example optimal in that it does not copy the underlying data?
https://golang.org/pkg/unicode/utf8/#example_DecodeRuneInString

func main() {
    str := "Hello, 世界" // let's assume a runtime-provided string
    for len(str) > 0 {
        r, size := utf8.DecodeRuneInString(str)
        fmt.Printf("%c %v
", r, size)
        str = str[size:] // performs copy?
    }
}

I found StringHeader in the (unsafe) reflect library. Is this the exact structure of a string in Go? If so, it is conceivable that slicing a string merely updates Data or allocates a new StringHeader altogether.

type StringHeader struct {
        Data uintptr
        Len  int
}

Bonus: where can I find the code that performs string slicing so that I could look it up myself? Any of these?
https://golang.org/src/runtime/slice.go
https://golang.org/src/runtime/string.go

This related SO answer suggests that runtime-strings incur a copy when converted from string to []byte.

  • 写回答

1条回答 默认 最新

  • douxia6554 2018-09-18 23:55
    关注

    Slicing Strings

    does slice of string perform copy of underlying data?

    No it does not. See this post by Russ Cox:

    A string is represented in memory as a 2-word structure containing a pointer to the string data and a length. Because the string is immutable, it is safe for multiple strings to share the same storage, so slicing s results in a new 2-word structure with a potentially different pointer and length that still refers to the same byte sequence. This means that slicing can be done without allocation or copying, making string slices as efficient as passing around explicit indexes.

    -- Go Data Structures

    Slices, Performance, and Iterating Over Runes

    A slice is basically three things: a length, a capacity, and a pointer to a location in an underlying array.

    As such, slices themselves are not very large: ints and a pointer (possibly some other small things in implementation detail). So the allocation required to make a copy of a slice is very small, and doesn't depend on the size of the underlying array. And no new allocation is required when you simply update the length, capacity, and pointer location, such as on line 2 of:

    foo := []int{3, 4, 5, 6}
    foo = foo[1:]
    

    Rather, it's when a new underlying array has to be allocated that a performance impact is felt.

    Strings in Go are immutable. So to change a string you need to make a new string. However, strings are closely related to byte slices, e.g. you can create a byte slice from a string with

    foo := `here's my string`
    fooBytes := []byte(foo)
    

    I believe that will allocate a new array of bytes, because:

    a string is in effect a read-only slice of bytes

    according to the Go Blog (see Strings, bytes, runes and characters in Go). In general you can use a slice to change the contents of an underlying array, so to produce a usable byte slice from a string you would have to make a copy to keep the user from changing what's supposed to be immutable.

    You could use performance profiling and benchmarking to gain further insight into the performance of your program.

    Once you have your slice of bytes, fooBytes, reslicing it does not allocate a new array, it just allocates a new slice, which is small. This appears to be what slicing a string does as well.

    Note that you don't need to use the utf8 package to count words in a utf8 string, though you may proceed that way if you like. Go handles utf8 natively. However if you want to iterate over characters you can't represent the string as a slice of bytes, because you could have multibyte characters. Instead you need to represent it as a slice of runes:

    foo := `here's my string`
    fooRunes := []rune(foo)
    

    This operation of converting a string to a slice of runes is fast in my experience (trivial in benchmarks I've done, but there may be an allocation). Now you can iterate across fooRunes to count words, no utf8 package required. Alternatively, you can skip the explicit []rune(foo) conversion and do it implicitly by using a for ... range loop on the string, because those are special:

    A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value.

    -- Strings, bytes, runes and characters in Go

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 pnpm 下载element-plus
  • ¥15 解决编写PyDracula时遇到的问题
  • ¥15 有没有人能解决下这个问题吗,本人不会编程
  • ¥15 plotBAPC画图出错
  • ¥30 关于#opencv#的问题:使用大疆无人机拍摄水稻田间图像,拼接成tif图片,用什么方法可以识别并框选出水稻作物行
  • ¥15 Python卡尔曼滤波融合
  • ¥20 iOS绕地区网络检测
  • ¥15 python验证码滑块图像识别
  • ¥15 根据背景及设计要求撰写设计报告
  • ¥20 能提供一下思路或者代码吗