duanguoping2016 2016-04-12 09:25
浏览 48
已采纳

角色可以在Go中跨越多个符文吗?

I read this on this blog

Even with rune slices a single character might span multiple runes, which can happen if you have characters with grave accent, for example. This complicated and ambiguous nature of "characters" is the reason why Go strings are represented as byte sequences.

Is it true ? (it seems like a blog from someone who knows Go). I tested on my machine and "è" is 1 rune and 2 bytes. And the Go doc seems to say otherwise.

Have you encountered such characters ? (utf-8) Can a character span multiple runes in Go ?

  • 写回答

1条回答 默认 最新

  • dtdt0454 2016-04-12 09:29
    关注

    Yes it can:

    s := "é́́"
    fmt.Println(s, []rune(s))
    

    Output (try it on the Go Playground):

    é́́ [101 769 769 769]
    

    One character, 4 runes. It may be arbitrary long...

    Example taken from The Go Blog: Text Normalization in Go.

    What is a character?

    As was mentioned in the strings blog post, characters can span multiple runes. For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character. The definition of a character may vary depending on the application. For normalization we will define it as a sequence of runes that starts with a starter, a rune that does not modify or combine backwards with any other rune, followed by possibly empty sequence of non-starters, that is, runes that do (typically accents). The normalization algorithm processes one character at at time.

    A character can be followed by any number of modifiers (modifiers can be repeated and stacked):

    Theoretically, there is no bound to the number of runes that can make up a Unicode character. In fact, there are no restrictions on the number of modifiers that can follow a character and a modifier may be repeated, or stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a perfectly valid 4-rune character according to the standard.

    Also see: Combining character.

    Edit: "Doesn't this kill the 'concept of runes'?"

    Answer: It's not a concept of runes. A rune is not a character. A rune is an integer value identifying a Unicode code point. A character may be one Unicode code point in which case 1 character is 1 rune. Most of the general use of runes fits into this case, so in practice this hardly gives any headaches. It's a concept of the Unicode standard.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP