duanhuihui2705 2018-08-24 14:09
浏览 72

结合字符(字形群集)和MS Windows控制台cmd.exe的Unicode

In the following code, the is not the single Unicode character U+00FC but is a single grapheme cluster composed of two Unicode characters, the plain ASCII u U+0075 followed by the combining diaeresis U+0308.

fmt.Println("Jürgen Džemal")
fmt.Println("Ju\u0308rgen \u01c5emel")

If I run it in the go playground, it works as expected.

If I run it in a MS Windows 10 "Command Prompt" window, it doesn't visually combine the combining character with the prior character. However when I cut and paste the text into here it appears correctly:

C:\> ver

Microsoft Windows [Version 10.0.17134.228]

C:\> test
Jürgen Džemal
Jürgen Džemel

On screen, in the "Command Prompt" window it looked more like:

Ju¨rgen Džemel

Changing the code page (chcp) from 850 to 65001 made no difference. Changing fonts (Consolas, Courier etc) made no difference.

In the past I have experienced problems that were fundamentally because Microsoft require Windows programs to use a different API to output characters to STDOUT depending on whether STDOUT is attached to a console or to a file. I don't know if this is a different manifestation of the same issue.

Is there something I can do to make this Unicode grapheme-cluster appear correctly?

  • 写回答

1条回答 默认 最新

  • duanke3985 2018-08-24 14:52
    关注

    As eryksun and Peter commented,

    • The Windows console (conhost.exe) doesn't support combining codes. You'll have to first normalize to an equivalent string that uses precomposed characters.
    • you can use golang.org/x/text/unicode/norm to do the normalization (e.g. norm.NFC.String("Jürgen Džemal"))

    I tried this

    s := "Ju\u0308rgen \u01c5emel"
    fmt.Println(s)              // dieresis not combined with u by conhost.exe
    s = norm.NFC.String(s)
    fmt.Println(s)              // shows correctly
    

    And the output looked like this

    Ju¨rgen Džemel   Jürgen Džemel

    or, for the visually impaired with fabulously sophisticated screen readers - a bit like this:

    Ju¨rgen Džemel
    Jürgen Džemel
    

    Note that Unicode has four different normalised forms but NFC is the most used on the Internet in web-pages and is also appropriate for this situation.

    There are other methods in this package that may be more efficient or more useful

    I read there are visual-characters in use which can only be represented in Unicode using combining characters. In other words for which there is no precomposed character. A more thorough approach would be needed to do something appropriate with those. Essentially the complications of Unicode (or perhaps more accurately of human languages and their typography) are almost without end. It sometimes seems that way to me.

    References

    评论

报告相同问题?

悬赏问题

  • ¥20 西门子S7-Graph,S7-300,梯形图
  • ¥50 用易语言http 访问不了网页
  • ¥50 safari浏览器fetch提交数据后数据丢失问题
  • ¥15 matlab不知道怎么改,求解答!!
  • ¥15 永磁直线电机的电流环pi调不出来
  • ¥15 用stata实现聚类的代码
  • ¥15 请问paddlehub能支持移动端开发吗?在Android studio上该如何部署?
  • ¥20 docker里部署springboot项目,访问不到扬声器
  • ¥15 netty整合springboot之后自动重连失效
  • ¥15 悬赏!微信开发者工具报错,求帮改