douchun1859 2015-05-31 02:56
浏览 45
已采纳

字符串到UCS-2

I want to translate in Go my python program to convert an unicode string to a UCS-2 HEX string.

In python, it's quite simple:

u"Bien joué".encode('utf-16-be').encode('hex')
-> 004200690065006e0020006a006f007500e9

I am a beginner in Go and the simplest way I found is:

package main

import (
    "fmt"
    "strings"
)

func main() {
    str := "Bien joué" 
    fmt.Printf("str: %s
", str)

    ucs2HexArray := []rune(str)
    s := fmt.Sprintf("%U", ucs2HexArray)
    a := strings.Replace(s, "U+", "", -1)
    b := strings.Replace(a, "[", "", -1)
    c := strings.Replace(b, "]", "", -1)
    d := strings.Replace(c, " ", "", -1)
    fmt.Printf("->: %s", d)
}

str: Bien joué
->: 004200690065006E0020006A006F007500E9
Program exited.

I really think it's clearly not efficient. How can-I improve it?

Thank you

展开全部

  • 写回答

3条回答 默认 最新

  • douzhulan1815 2015-05-31 05:24
    关注

    Make this conversion a function then you can easily improve the conversion algorithm in the future. For example,

    package main
    
    import (
        "fmt"
        "strings"
        "unicode/utf16"
    )
    
    func hexUTF16FromString(s string) string {
        hex := fmt.Sprintf("%04x", utf16.Encode([]rune(s)))
        return strings.Replace(hex[1:len(hex)-1], " ", "", -1)
    }
    
    func main() {
        str := "Bien joué"
        fmt.Println(str)
        hex := hexUTF16FromString(str)
        fmt.Println(hex)
    }
    

    Output:

    Bien joué
    004200690065006e0020006a006f007500e9
    

    NOTE:

    You say "convert an unicode string to a UCS-2 string" but your Python example uses UTF-16:

    u"Bien joué".encode('utf-16-be').encode('hex')
    

    The Unicode Consortium

    UTF-16 FAQ

    Q: What is the difference between UCS-2 and UTF-16?

    A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

    UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters.

    Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)
编辑
预览

报告相同问题?