utf8 golang中的第二个字节下限

I was recently going through the go source code of utf8 decoding. Apparently when decoding utf8 bytes, when the first byte has the value 224 (0xE0) it maps to an accept range of [0xA0; 0xBF]. https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L81 https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L94

If I understand the utf8 spec (https://tools.ietf.org/html/rfc3629) correctly every continuation byte has the minimum value of 0x80 or 1000 0000. Why is the minimum value for opening byte with 0xE0 higher, i.e. 0xA0 instead of 0x80?

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongliang2005 2017-12-12 10:41
关注
The reason is to prevent so-called overlong sequences. Quoting the RFC:

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.

[...]

A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

Also note the syntax rules in section 4 which explicitly only allow characters A0-BF after E0:

UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / %xF4 %x80-8F 2( UTF8-tail )
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

Golang将UTF16字符串转换为UTF8
2016-10-19 00:29

回答 2 已采纳 Parse the hex string as an integer. Use a string conversion to convert the integer to UTF-8. n, e
如何在golang中以UTF-8编码gob？
2017-01-21 22:49

回答 1 已采纳 When you call Encode(msg), you are not sending UTF-8 plain text. To send plain text: conn.Write(
如何在Golang中使用utf8将[] rune编码为[] byte？
2015-03-25 12:31

回答 1 已采纳 You can simply convert a rune slice ([]rune) to string which you can convert back to []byte. Exam
Golang学习总结
2020-08-03 11:02

weixin_43898920的博客目录Go语言结构Go语言基础语法Go语言数据类型数字类型浮点型其它数字类型Go语言变量多变量的声明go语言运算符Go语言中的条件语句Go语言switch语句Go语言select语句Go语言循环语句for循环Go语言中的函数Go语言中的...
Golang：从一个字节解析位值
2019-02-21 14:21

回答 1 已采纳 Some useful Go standard library packages for dealing with binary: encoding/binary math/bits Fo
在Golang中查找模式的字节偏移
2019-06-27 21:02

回答 1 已采纳 FindAllStringIndex(s string, n int) returns byte start/finish indexes (i.e., slices) of all succes
在Go中将带有UTF-8字节字符串的命令行输出转换为Unicode代码点
2019-04-10 18:21

回答 1 已采纳 You can use the strconv package to parse the string literal containing the escape sequences. The
Golang
2021-05-29 14:18

programmer_Ning的博客 /* 这是我的第一个简单的程序 */ fmt.Println("Hello, World!") } Go 语言的空格 Go语言中变量的声明必须使用空格隔开, 如: var age int; 格式化字符串 Go语言中使用fmt.Sprintf格式化字符串并赋值给新串: ...
如何确定golang中连接的确切字节长度？
2018-08-16 22:08

回答 1 已采纳 Read returns the number of bytes read to the buffer. Because the length of the buffer passed to th
在Golang中，如何计算字节中有多少个比特？
2017-08-05 09:10

回答 4 已采纳 Given that the input is a single byte probably a lookup table is the best option... only takes 256
如何使用UTF-8字符串检查golang中的字符值？
2016-04-21 06:58

回答 2 已采纳 Indexing a string indexes its bytes (in UTF-8 encoding - this is how Go stores strings in memory),
【Golang 快速入门】基础语法 + 面向对象
2022-02-06 23:37

萌宅鹿同学的博客 Golang 语言特性 Golang 的优势 Golang 的应用场景 Golang 的不足基础语法 main 变量常量与 iota string 字符串遍历 strings 包 bytes 包 strconv 包 unicode 包循环语句 range 函数多返回值 init 函数闭包 ...
Golang - 基本数据类型
2019-05-20 16:40

叁丶贰壹的博客 "golang" c1 := 'g' // ascii 1个字节(一个字节 8bit) s2 := "中国" c2 := '中' // utf-8 3个字节 fmt.Println(s1, c1) // t1输出的是字符串, c1输出的是ascii码号 fmt.Println(s2, c2) s3 := "hello 中国" /...
golang微信小程序爬虫教程offer秀
2021-07-18 09:17

加油2019的博客周末帮女友手查的各大厂薪资情况，忙活了一个下午，真的是好无聊啊，所以决定写一个爬虫程序，自动爬取。图片offershow界面，以下采用秀代替offer秀因为本人本地开发环境是golang，所以还是采用golang，需求目标...
Golang 教程
2018-10-12 17:21

Paul_0920的博客 1.bool，一个字节，值是true或者false，不可以用0或者1表示（java中boolean占用4个字节，而boolean作为数组出现时，每个boolean占用1个字节） 2.int/uint(带符号为与不带符号位的int类型)：根据平台不同是32...
没有解决我的问题, 去提问

utf8 golang中的第二个字节下限

2条回答 默认 最新

2条回答默认最新