如何在Golang中使用utf8将[] rune编码为[] byte？

So it's really easy to decode a []byte into a []rune (simply cast to string, then cast to []rune works very nicely, I'm assuming it defaults to utf8 and with filler bytes for invalids). My question is - how are you suppose to decode this []rune back to []byte in utf8 form?

Am I missing something or do I have manually call EncodeRune for every single rune in my []rune? Surely there is an encoder that I can simply pass a Writer to.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanhuo7441 2015-03-25 12:35
关注
You can simply convert a rune slice ([]rune) to string which you can convert back to []byte.

Example:

rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'} bs := []byte(string(rs)) fmt.Printf("%s ", bs) fmt.Println(string(bs))

Output (try it on the Go Playground):

Hello 世界 Hello 世界

The Go Specification: Conversions mentions this case explicitly: Conversions to and from a string type, point #3:

Converting a slice of runes to a string type yields a string that is the concatenation of the individual rune values converted to strings.

Note that the above solution–although may be the simplest–might not be the most efficient. And the reason is because it first creates a string value that will hold a "copy" of the runes in UTF-8 encoded form, then it copies the backing slice of the string to the result byte slice (a copy has to be made because string values are immutable, and if the result slice would share data with the string, we would be able to modify the content of the string; for details, see golang: []byte(string) vs []byte(*string) and Immutable string and pointer address).

^{Note that a smart compiler could detect that the intermediate string value cannot be referred to and thus eliminate one of the copies.}

We may get better performance by allocating a single byte slice, and encode the runes one-by-one into it. And we're done. To easily do this, we may call the unicode/utf8 package to our aid:

rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'} bs := make([]byte, len(rs)*utf8.UTFMax) count := 0 for _, r := range rs { count += utf8.EncodeRune(bs[count:], r) } bs = bs[:count] fmt.Printf("%s ", bs) fmt.Println(string(bs))

Output of the above is the same. Try it on the Go Playground.

Note that in order to create the result slice, we had to guess how big the result slice will be. We used a maximum estimation, which is the number of runes multiplied by the max number of bytes a rune may be encoded to (utf8.UTFMax). In most cases, this will be bigger than needed.

We may create a third version where we first calculate the exact size needed. For this, we may use the utf8.RuneLen() function. The gain will be that we will not "waste" memory, and we won't have to do a final slicing (bs = bs[:count]).

Let's compare the performances. The 3 functions (3 versions) to compare:

func runesToUTF8(rs []rune) []byte { return []byte(string(rs)) } func runesToUTF8Manual(rs []rune) []byte { bs := make([]byte, len(rs)*utf8.UTFMax) count := 0 for _, r := range rs { count += utf8.EncodeRune(bs[count:], r) } return bs[:count] } func runesToUTF8Manual2(rs []rune) []byte { size := 0 for _, r := range rs { size += utf8.RuneLen(r) } bs := make([]byte, size) count := 0 for _, r := range rs { count += utf8.EncodeRune(bs[count:], r) } return bs }

And the benchmarking code:

var rs = []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'} func BenchmarkFirst(b *testing.B) { for i := 0; i < b.N; i++ { runesToUTF8(rs) } } func BenchmarkSecond(b *testing.B) { for i := 0; i < b.N; i++ { runesToUTF8Manual(rs) } } func BenchmarkThird(b *testing.B) { for i := 0; i < b.N; i++ { runesToUTF8Manual2(rs) } }

And the results:

BenchmarkFirst-4 20000000 95.8 ns/op BenchmarkSecond-4 20000000 84.4 ns/op BenchmarkThird-4 20000000 81.2 ns/op

As suspected, the second version is faster and the third version is the fastest, although the performance gain is not huge. In general the first, simplest solution is preferred, but if this is in some critical part of your app (and is executed many-many times), the third version might worth it to be used.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

如何在Golang中使用utf8将[] rune编码为[] byte？
2015-03-25 12:31

回答 1 已采纳 You can simply convert a rune slice ([]rune) to string which you can convert back to []byte. Exam
golang将iso8859-1转换为utf8
2012-11-22 10:18

回答 2 已采纳 rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode cha
清理错误的UTF-8字符串
2019-09-19 18:59

回答 3 已采纳 You could improve your "sanitiser" by dropping invalid runes: package main import ( "fmt"
Golang中rune和Byte，字符和字符串有什么不一样
2023-11-29 19:38

小信啊啊的博客 Go语言中，string就是只读的采用utf8编码的字节切片(slice) 因此用len函数获取到的长度并不是字符个数，而是字节个数。for循环遍历输出的也是各个字节。rune是int32的别名，代表字符的Unicode编码，采用4个字节存储...
Golang中的字符串转换和Unicode
2018-06-15 15:15

回答 1 已采纳 You are quoting from a weak, unreliable source: Go Essentials: Strings. Amongst other things, ther
转换符文为整数？
2014-01-24 00:27

回答 2 已采纳 The problem is simpler than it looks. You convert a rune value to an int value with int(r). But yo
角色可以在Go中跨越多个符文吗？
2016-04-12 09:25

回答 1 已采纳 Yes it can: s := "é́́" fmt.Println(s, []rune(s)) Output (try it on the Go Playground): é́́ [
Golang的rune数据类型，Unicode字符编码与UTF-8字节码
2020-12-21 11:30

HayPinF的博客 rune类型： // rune is an alias for int32 and is equivalent to int32 in all ways....// used, by convention, to distinguish ...//int32的别名，几乎在所有方面等同于int32 //它用来区分字符值和整数值 type r
如何在go中获取字符的Unicode值？
2015-03-20 07:13

回答 3 已采纳 Strings are utf8 encoded, so to decode a character from a string to get the rune (unicode code poi
未读取golang unicode / norm迭代器的最后符文
2015-07-05 22:25

回答 1 已采纳 It is possible this is a bug in golang.org/x/text/unicode/norm and its Init() function. In the p
io.Reader和涉及CSV文件的换行问题
2017-07-06 11:20

回答 2 已采纳 For anyone who's stumbled on this and wants an answer that doesn't involve strings.Replace, here's
golang 解析UTF8编码形式的字符串
2022-11-10 20:14

控场的朴哥的博客 go语言解析序列化成UTF-8码的JSON字符串
defineRuneInternal和encodeRuneInStringInternal有什么区别
2016-01-16 03:42

回答 1 已采纳 The two functions avoid the memory allocation in the conversion []byte(s) in the case where the st
golang学习——utf-8包使用
2020-09-14 11:19

银灯玉箫的博客 golang学习 - unicode/utf8 包转载于:https://blog.51cto.com/pkbai/1877285 ---------------------------- // 编码所需的基本数字 const ( RuneError = '\uFFFD' // 错误的 Rune 或 Unicode 代理字符 ...
golang：字符类型(byte和rune)
2021-08-22 19:18

OceanStar的学习笔记的博客字符串中的每一个元素叫做“字符”，在遍历或者单个获取字符串元素时可以获得字符 golang语言的字符有如下两种：一种是uint8 类型，或者叫 byte 型，代表了 ASCII 码的一个字符。另一种是rune 类型，代表一个 UTF...
没有解决我的问题, 去提问

悬赏问题

¥15 基于卷积神经网络的声纹识别
¥15 Python中的request，如何使用ssr节点，通过代理requests网页。本人在泰国，需要用大陆ip才能玩网页游戏，合法合规。
¥100 为什么这个恒流源电路不能恒流？
¥15 有偿求跨组件数据流路径图
¥15 写一个方法checkPerson，入参实体类Person，出参布尔值
¥15 我想咨询一下路面纹理三维点云数据处理的一些问题，上传的坐标文件里是怎么对无序点进行编号的，以及xy坐标在处理的时候是进行整体模型分片处理的吗
¥15 CSAPPattacklab
¥15 一直显示正在等待HID—ISP
¥15 Python turtle 画图
¥15 stm32开发clion时遇到的编译问题

如何在Golang中使用utf8将[] rune编码为[] byte？

1条回答 默认 最新

悬赏问题

1条回答默认最新