dpvr49226
2016-04-05 12:25
浏览 395
已采纳

为什么utf 8.Valid String函数无法检测到无效的unicode字符?

From https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points, I got to know that U+D800 through U+DFFF are invalid. So in decimal system, it is 55296 through 57343.

And Maximum valid Unicode is '\U0010FFFF'. In decimal system, it is 1114111

My code:

package main

import "fmt"
import "unicode/utf8"

func main() {

    fmt.Println("Case 1(Invalid Range)")
    str := fmt.Sprintf("%c", rune(55296+1))
    if !utf8.ValidString(str) {
        fmt.Print(str, " is not a valid Unicode")
    } else {
        fmt.Println(str, " is valid unicode character")
    }

    fmt.Println("Case 2(More than maximum valid range)")
    str = fmt.Sprintf("%c", rune(1114111+1))
    if !utf8.ValidString(str) {
        fmt.Print(str, " is not a valid Unicode")
    } else {
        fmt.Println(str, " is valid unicode character")
    }
}

Why ValidString function is not returning false for invalid unicode characters given as input ? I am sure my understanding is wrong, could some one explain??

图片转代码服务由CSDN问答提供 功能建议

来自 https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points ,我知道U + D800至U + DFFF是无效的。 因此,在十进制系统中,它是55296到57343。

,最大有效Unicode是'\ U0010FFFF'。 在十进制系统中,它是1114111

我的代码:

 包main 
 
import“ fmt” 
import“ unicode / utf8  “ 
 
func main(){
 
 fmt.Println(”案例1(无效范围)“)
 str:= fmt.Sprintf(”%c“,符文(55296 + 1))
如果 !utf8.ValidString(str){
 fmt.Print(str,“不是有效的Unicode”“)
}其他{
 fmt.Println(str,”是有效的Unicode字符“)
} 
 \  n fmt.Println(“情况2(大于最大有效范围)”)
 str = fmt.Sprintf(“%c”,rune(1114111 + 1))
如果!utf8.ValidString(str){
  fmt.Print(str,“不是有效的Unicode”)
}否则{
 fmt.Println(str,“是有效的Unicode字符”)
} 
} 
    
 
 

对于给定的无效unicode字符,为什么ValidString函数不返回false? 我确定我的理解是错误的,有人可以解释吗?

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

2条回答 默认 最新

  • douxi2011 2016-04-05 13:17
    已采纳

    Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with rune(65533) which is the unicode replacement character used instead of invalid characters. So your string is valid UTF8.

    This will also happen if you do something like this: str := string([]rune{ 55297 }) so this might be something that happens when creating runes. It's not immediately obvious from: https://blog.golang.org/strings

    If you want to force your string to contain invalid UTF8 you can write the first string like this:

    str := string([]byte{237, 159, 193})
    
    点赞 评论
  • douchai7891 2016-04-05 13:17

    You take an invalid value and convert it using Sprintf. It's converted to the error value. You then check the error value, which is a valid Unicode code point.

    package main
    
    import (
        "fmt"
        "unicode/utf8"
    )
    
    func main() {
    
        fmt.Println("Case 1: Invalid Range")
        str := fmt.Sprintf("%c", rune(55296+1))
        fmt.Printf("%q %X %d %d
    ", str, str, []rune(str)[0], utf8.RuneError)
        if !utf8.ValidString(str) {
            fmt.Print(str, " is not a valid Unicode")
        } else {
            fmt.Println(str, " is valid unicode character")
        }
    
        fmt.Println("Case 2: More than maximum valid range")
        str = fmt.Sprintf("%c", rune(1114111+1))
        fmt.Printf("%q %X %d %d
    ", str, str, []rune(str)[0], utf8.RuneError)
        if !utf8.ValidString(str) {
            fmt.Print(str, " is not a valid Unicode")
        } else {
            fmt.Println(str, " is valid unicode character")
        }
    
    }
    

    Output:

    Case 1: Invalid Range
    "�" EFBFBD 65533 65533
    �  is valid unicode character
    Case 2: More than maximum valid range
    "�" EFBFBD 65533 65533
    �  is valid unicode character
    
    点赞 评论

相关推荐 更多相似问题