dongshui2254 2014-02-11 08:39
浏览 40

在Go中输出无引号的Unicode

I'm using goyaml as a YAML beautifier. By loading and dumping a YAML file, I can source-format it. I unmarshal the data from a YAML source file into a struct, marshal those bytes, and write the bytes to an output file. But the process morphs my Unicode strings into the literal version of the quoted strings, and I don't know how to reverse it.

Example input subtitle.yaml:

line: 你好

I've stripped everything down to the smallest reproducible problem. Here's the code, using _ to catch errors which don't pop-up:

package main                                                                                                                                                                                      

import (                                                                                                                                                                                          
    "io/ioutil"                                                                                                                                                                                   
    //"unicode/utf8"                                                                                                                                                                              
    //"fmt"                                                                                                                                                                                       

    "gopkg.in/yaml.v1"                                                                                                                                                                        
)                                                                                                                                                                                                 

type Subtitle struct {                                                                                                                                                                            
    Line string                                                                                                                                                                                   
}                                                                                                                                                                                                 

func main() {                                                                                                                                                                                     
    filename := "subtitle.yaml"                                                                                                                                                                   
    in, _ := ioutil.ReadFile(filename)                                                                                                                                                            
    var subtitle Subtitle                                                                                                                                                                         
    _ = goyaml.Unmarshal(in, &subtitle)                                                                                                                                                           
    out, _ := goyaml.Marshal(&subtitle)                                                                                                                                                           

    //for len(out) > 0 { // For debugging, see what the runes are                                                                                                                                                                         
    //  r, size := utf8.DecodeRune(out)                                                                                                                                                             
    //  fmt.Printf("%c ", r)                                                                                                                                                              
    //  out = out[size:]                                                                                                                                                                            
    //}                                                                                                                                                                                           

    _ = ioutil.WriteFile(filename, out, 0644)                                                                                                                                                     
}

Actual output subtitle.yaml:

line: "\u4F60\u597D"

I want to reverse the weirdness in goyaml after I get the variable out.

The commented-out rune-printing code block, which adds spaces between runes for clarity, outputs the following. It shows that Unicode runes like aren't being decoded, but treated literally:

l i n e :   " \ u 4 F 6 0 \ u 5 9 7 D "

How can I unquote out, before writing it to the output file, so that the output looks like the input (albeit beautified)?

Desired output subtitle.yaml:

line: "你好"

Temporary Solution

I've filed https://github.com/go-yaml/yaml/issues/11. In the meantime, @bobince's tip on yaml_emitter_set_unicode was helpful in unconvering the problem. It was defined as a C binding but never called (or given an option to set it)! I changed encode.go and added yaml_emitter_set_unicode(&e.emitter, true) to line 20, and everything works as expected. It would be better to make it optional, but that would require a change in the Marshal API.

  • 写回答

1条回答 默认 最新

  • douyu0792 2014-02-11 18:38
    关注

    Had a similar issue and could apply this to circumvent the bug in goyaml.Marshal(). (*Regexp) ReplaceAllFunc is your friend which you can use to expand the escaped Unicode runes in the byte array. A little bit too dirty for production maybe, but works for the example ;-)

    package main                                                                                                                                                                                      
    
    import (                                                                                                                                                                                          
        "io/ioutil"                                                                                                                                                                                   
        "unicode/utf8"                                                                                                                                                                              
        "regexp"
        "strconv"
        "launchpad.net/goyaml"                                                                                                                                                                        
    )                                                                                                                                                                                                 
    
    type Subtitle struct {                                                                                                                                                                            
        Line string                                                                                                                                                                                   
    }                                                                                                                                                                                                 
    
    var reFind = regexp.MustCompile(`^\s*[^\s\:]+\:\s*".*\\u.*"\s*$`)
    var reFindU = regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
    
    func expandUnicodeInYamlLine(line []byte) []byte {
      // TODO: restrict this to the quoted string value
      return reFindU.ReplaceAllFunc(line, expandUnicodeRune)
    }
    
    func expandUnicodeRune(esc []byte) []byte {
      ri, _:= strconv.ParseInt(string(esc[2:]), 16, 32)
      r := rune(ri)
      repr := make([]byte, utf8.RuneLen(r))
      utf8.EncodeRune(repr, r)
      return repr
    }
    
    func main() {                                                                                                                                                                                     
        filename := "subtitle.yaml"
        filenameOut := "subtitleout.yaml"
        in, _ := ioutil.ReadFile(filename)                                                                                                                                                            
        var subtitle Subtitle                                                                                                                                                                         
        _ = goyaml.Unmarshal(in, &subtitle)
        out, _ := goyaml.Marshal(&subtitle)                                                                                                                                                           
    
        out = reFind.ReplaceAllFunc(out, expandUnicodeInYamlLine)
        _ = ioutil.WriteFile(filenameOut, out, 0644)                                                                                                                                                     
    }
    
    评论

报告相同问题?

悬赏问题

  • ¥30 eclipse开启服务后,网页无法打开
  • ¥30 雷达辐射源信号参考模型
  • ¥15 html+css+js如何实现这样子的效果?
  • ¥15 STM32单片机自主设计
  • ¥15 如何在node.js中或者java中给wav格式的音频编码成sil格式呢
  • ¥15 不小心不正规的开发公司导致不给我们y码,
  • ¥15 我的代码无法在vc++中运行呀,错误很多
  • ¥50 求一个win系统下运行的可自动抓取arm64架构deb安装包和其依赖包的软件。
  • ¥60 fail to initialize keyboard hotkeys through kernel.0000000000
  • ¥30 ppOCRLabel导出识别结果失败