字符串切片是否执行基础数据的复制？

I am trying to efficiently count runes from a utf-8 string using the utf8 library. Is this example optimal in that it does not copy the underlying data?
https://golang.org/pkg/unicode/utf8/#example_DecodeRuneInString

func main() {
    str := "Hello, 世界" // let's assume a runtime-provided string
    for len(str) > 0 {
        r, size := utf8.DecodeRuneInString(str)
        fmt.Printf("%c %v
", r, size)
        str = str[size:] // performs copy?
    }
}

I found StringHeader in the (unsafe) reflect library. Is this the exact structure of a string in Go? If so, it is conceivable that slicing a string merely updates Data or allocates a new StringHeader altogether.

type StringHeader struct {
        Data uintptr
        Len  int
}

Bonus: where can I find the code that performs string slicing so that I could look it up myself? Any of these?
https://golang.org/src/runtime/slice.go
https://golang.org/src/runtime/string.go

This related SO answer suggests that runtime-strings incur a copy when converted from string to []byte.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douxia6554 2018-09-18 23:55
关注
Slicing Strings

does slice of string perform copy of underlying data?

No it does not. See this post by Russ Cox:

A string is represented in memory as a 2-word structure containing a pointer to the string data and a length. Because the string is immutable, it is safe for multiple strings to share the same storage, so slicing s results in a new 2-word structure with a potentially different pointer and length that still refers to the same byte sequence. This means that slicing can be done without allocation or copying, making string slices as efficient as passing around explicit indexes.

-- Go Data Structures

Slices, Performance, and Iterating Over Runes

A slice is basically three things: a length, a capacity, and a pointer to a location in an underlying array.

As such, slices themselves are not very large: ints and a pointer (possibly some other small things in implementation detail). So the allocation required to make a copy of a slice is very small, and doesn't depend on the size of the underlying array. And no new allocation is required when you simply update the length, capacity, and pointer location, such as on line 2 of:

foo := []int{3, 4, 5, 6} foo = foo[1:]

Rather, it's when a new underlying array has to be allocated that a performance impact is felt.

Strings in Go are immutable. So to change a string you need to make a new string. However, strings are closely related to byte slices, e.g. you can create a byte slice from a string with

foo := `here's my string` fooBytes := []byte(foo)

I believe that will allocate a new array of bytes, because:

a string is in effect a read-only slice of bytes

according to the Go Blog (see Strings, bytes, runes and characters in Go). In general you can use a slice to change the contents of an underlying array, so to produce a usable byte slice from a string you would have to make a copy to keep the user from changing what's supposed to be immutable.

You could use performance profiling and benchmarking to gain further insight into the performance of your program.

Once you have your slice of bytes, fooBytes, reslicing it does not allocate a new array, it just allocates a new slice, which is small. This appears to be what slicing a string does as well.

Note that you don't need to use the utf8 package to count words in a utf8 string, though you may proceed that way if you like. Go handles utf8 natively. However if you want to iterate over characters you can't represent the string as a slice of bytes, because you could have multibyte characters. Instead you need to represent it as a slice of runes:

foo := `here's my string` fooRunes := []rune(foo)

This operation of converting a string to a slice of runes is fast in my experience (trivial in benchmarks I've done, but there may be an allocation). Now you can iterate across fooRunes to count words, no utf8 package required. Alternatively, you can skip the explicit []rune(foo) conversion and do it implicitly by using a for ... range loop on the string, because those are special:

A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value.

-- Strings, bytes, runes and characters in Go
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

悬赏问题

¥15 关于#Java#的问题，如何解决？
¥15 加热介质是液体，换热器壳侧导热系数和总的导热系数怎么算
¥15 想问一下树莓派接上显示屏后出现如图所示画面，是什么问题导致的
¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
¥15 cmd cl 0x000007b
¥20 BAPI_PR_CHANGE how to add account assignment information for service line
¥500 火焰左右视图、视差（基于双目相机）
¥100 set_link_state
¥15 虚幻5 UE美术毛发渲染
¥15 CVRP 图论物流运输优化

字符串切片是否执行基础数据的复制？

1条回答 默认 最新

Slicing Strings

Slices, Performance, and Iterating Over Runes

悬赏问题

1条回答默认最新