dongxibeng5324 2017-12-12 09:48
浏览 65
已采纳

utf8 golang中的第二个字节下限

I was recently going through the go source code of utf8 decoding. Apparently when decoding utf8 bytes, when the first byte has the value 224 (0xE0) it maps to an accept range of [0xA0; 0xBF]. https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L81 https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L94

If I understand the utf8 spec (https://tools.ietf.org/html/rfc3629) correctly every continuation byte has the minimum value of 0x80 or 1000 0000. Why is the minimum value for opening byte with 0xE0 higher, i.e. 0xA0 instead of 0x80?

  • 写回答

2条回答 默认 最新

  • dongliang2005 2017-12-12 10:41
    关注

    The reason is to prevent so-called overlong sequences. Quoting the RFC:

    Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.

    [...]

    A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

    Also note the syntax rules in section 4 which explicitly only allow characters A0-BF after E0:

    UTF8-2      = %xC2-DF UTF8-tail  
    UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                  %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )  
    UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                  %xF4 %x80-8F 2( UTF8-tail )
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部