douzhiji2020 2016-08-23 22:24
浏览 62
已采纳

Golang Stdin读错了德国变音符

I'm from germany so I use umlauts like ä, ö and ü. Golang however doesn't read them correctly from stdin.

When I execute this simple program:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {
    for {
        b, _, _ := bufio.NewReader(os.Stdin).ReadLine()
        printBytes(b)
    }

}

func printBytes(bytes []byte) {
    for _, b := range bytes {
        fmt.Printf("0x%X ", b)
    }
    fmt.Println()
}

I get the output:

C:\dev\golang>go run test.go
ä
0xE2 0x80 0x9E

E2 80 9E isn't the correct byte sequence for the ä in UTF-8 (this tool tells me it's an "DOUBLE LOW-9 QUOTATION MARK" -> ) and when I just print out what I've read it prints ". I've written a small "hack" which seems to read the characters correct:

package main

/*
#include <stdio.h>
#include <stdlib.h>

char * getline(void) {
    char * line = malloc(100), * linep = line;
    size_t lenmax = 100, len = lenmax;
    int c;

    if(line == NULL)
        return NULL;

    for(;;) {
        c = fgetc(stdin);
        if(c == EOF)
            break;

        if(--len == 0) {
            len = lenmax;
            char * linen = realloc(linep, lenmax *= 2);

            if(linen == NULL) {
                free(linep);
                return NULL;
            }
            line = linen + (line - linep);
            linep = linen;
        }

        if((*line++ = c) == '
')
            break;
    }
    *line = '\0';
    return linep;
}

void freeline(char* ptr) {
    free(ptr);
}
*/
import "C"

import (
    "fmt"
    "golang.org/x/text/encoding/charmap"
)

func getLineFromCp850() string {
    line := C.getline()
    goline := C.GoString(line)
    C.freeline(line)
    b := []byte(goline)
    ub, _ := charmap.CodePage850.NewDecoder().Bytes(b)
    return string(ub)
}

func main() {
    for {
        line := getLineFromCp850()
        printBytes([]byte(line))
    }

}

func printBytes(bytes []byte) {
    for _, b := range bytes {
        fmt.Printf("0x%X ", b)
    }
    fmt.Println()
}

And it prints out:

C:\dev\golang>go run test.go
ä
0xC3 0xA4 0xA

C3 A4 is the correct bytesequence for the ä (0A is the linefeed which my hack doesn't strip) so it seems like, reading and converting from CP850 to UTF-8 does the job, as I expected, but why does Go give me gibberish when I read the line using Go's functionality instead of cgo? Whats wrong with Go that it gives me those values, doesn't it interpret the input bytes as CP850 but another charset? Is there a better Go-only way to handle this problem?

This problem only arises when reading from stdin. When I print out a UTF-8 ä to stdout it prints correctly in the console.

  • 写回答

1条回答 默认 最新

  • dsieyx2015 2016-08-25 00:42
    关注

    So it was a bug in Golang for some systems, to be specific for Windows systems where the overall used charset and the console charset were different (Where GetACP() and GetConsoleCP() from WinAPI returned different things). In Germany, for example, (and maybe other west-european countries), Windows uses the codepage 1252 as the overall-charset but it uses codepage 850 for the console cmd.exe. Not sure why, but thats how it is. Golang wrongly used GetACP() to decode the input to UTF-8 when it really should've used the codepage returned by GetConsoleCP(). We found the problem in the Issue I created and we'll hopefully see the fix merged for the next version of Golang.

    We also found a problem on Windows where Golang decoded characters to decomposed UTF-8 characters (i.e. it would read a ä to the character a followed by the COMBINING DIAERESIS ̈) which could lead to other problems, for example printing those decomposed characters prints them separate instead of one combined character.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP