dopii22884 2015-09-18 02:59
浏览 40

在Go语言中使用io.Copy时,为什么文本的编码不同?

I am trying to rebuild a tee-like util by go language on Windows. But I found the encoding of the output is not always the same.

To make the problem simple, I wrote this program:

package main

import (
    "fmt"
    "io"
    "os"
)

func main() {
    count, err := io.Copy(os.Stdout, os.Stdin)
    fmt.Println(count, err)
}

I named it test. In the Windows command console, I got these output:

>test
中
中
5 <nil>

It works fine with no pipe and redirect.

>echo 中 | test
��
5 <nil>

The output is collapsed if I get stdin from a pipe.

>echo 中 | test > test.txt

>type test.txt
中
5 <nil>

It works again when I redirect the output to a file.

>test > test.txt
中

>type test.txt
荳ュ
5 <nil>

But not work when I use the normal stdin and redirect to a file. If I open this test.txt here by other editors like notepad++, I found it is encoded in UTF-8 and the content is .

If I use Cygwin with a UTF-8 encoded console on Windows, everything is just good.

From the output, I know that the number of bytes the program copied is 5, which means it is using UTF-8 in the program no matter what the stdin is. But as I know the windows command line console is basically use non-unicode encoding, why it is converted into UTF-8? And is there a way to let the program just copy what the stdin send without any converting?

btw. If I use tee from gnuWin32 to do the same test, everything just works good.

>where tee
D:\Tools\gnuWin32\bin\tee.exe

>echo 中 | tee
中

>tee tee.txt
中
中
^C
>type tee.txt
中

Is there anyone know the reason of this and what is the solution?

  • 写回答

1条回答 默认 最新

  • douan3019 2015-09-18 04:48
    关注

    it not use utf8, why 5 bytes wrote is because there a space(0x20) after 中

    C:\Users\jan>echo 中| go run src/main.go
    00000000  d6 d0 0d 0a                                       |....|
    ��
    4 <nil>
    

    so in my system, console not use utf8, but GBK.

    the bug is because windows console can not change the on screen character even the appended byte make the character another one. e.g. 'd6 d0' is 中, d6 already on screen as �, 0a appended, not make the two byte be one display character.

    for testing, i have a c# console program

    static void Main(string[] args)
        {
    
            using (Stream stdout = Console.OpenStandardOutput())
            {
                stdout.WriteByte((byte)'A');
                stdout.WriteByte(0xd6);
                stdout.WriteByte(0xd0);
            }
    
            using (Stream stdout = Console.OpenStandardOutput())
            {
                stdout.WriteByte((byte)'B');
                stdout.WriteByte(0xd6);
    
            }
    
            using (Stream stdout = Console.OpenStandardOutput())
            {
                stdout.WriteByte(0xd0);
            }
        }
    

    get result:

    A中BPress any key to continue . . .
    

    so I guess windows libc have a buffer before stdout, it make up two bytes be one character and print to console.

    the interesting thing i found is that, even if windows console in gbk page, go lang can write stdou with utf8 encoding. seems bytes wrote to os.Stdout not directly passed to console.

    package main
    
    import (
        "fmt"
        "os"
    )
    
    func main() {
        os.Stdout.Write([]byte{0xe4,0xb8,0xad})
        fmt.Print("\xe4")
        fmt.Print("\xb8")
        fmt.Println("\xad")
    }
    

    got:

    C:\Users\jan>go run src/main.go
    中中
    
    C:\Users\jan>
    
    评论

报告相同问题?

悬赏问题

  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)