I'm from germany so I use umlauts like ä
, ö
and ü
. Golang however doesn't read them correctly from stdin.
When I execute this simple program:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
for {
b, _, _ := bufio.NewReader(os.Stdin).ReadLine()
printBytes(b)
}
}
func printBytes(bytes []byte) {
for _, b := range bytes {
fmt.Printf("0x%X ", b)
}
fmt.Println()
}
I get the output:
C:\dev\golang>go run test.go
ä
0xE2 0x80 0x9E
E2 80 9E
isn't the correct byte sequence for the ä
in UTF-8 (this tool tells me it's an "DOUBLE LOW-9 QUOTATION MARK" -> „
) and when I just print out what I've read it prints "
. I've written a small "hack" which seems to read the characters correct:
package main
/*
#include <stdio.h>
#include <stdlib.h>
char * getline(void) {
char * line = malloc(100), * linep = line;
size_t lenmax = 100, len = lenmax;
int c;
if(line == NULL)
return NULL;
for(;;) {
c = fgetc(stdin);
if(c == EOF)
break;
if(--len == 0) {
len = lenmax;
char * linen = realloc(linep, lenmax *= 2);
if(linen == NULL) {
free(linep);
return NULL;
}
line = linen + (line - linep);
linep = linen;
}
if((*line++ = c) == '
')
break;
}
*line = '\0';
return linep;
}
void freeline(char* ptr) {
free(ptr);
}
*/
import "C"
import (
"fmt"
"golang.org/x/text/encoding/charmap"
)
func getLineFromCp850() string {
line := C.getline()
goline := C.GoString(line)
C.freeline(line)
b := []byte(goline)
ub, _ := charmap.CodePage850.NewDecoder().Bytes(b)
return string(ub)
}
func main() {
for {
line := getLineFromCp850()
printBytes([]byte(line))
}
}
func printBytes(bytes []byte) {
for _, b := range bytes {
fmt.Printf("0x%X ", b)
}
fmt.Println()
}
And it prints out:
C:\dev\golang>go run test.go
ä
0xC3 0xA4 0xA
C3 A4
is the correct bytesequence for the ä
(0A is the linefeed which my hack doesn't strip) so it seems like, reading and converting from CP850 to UTF-8 does the job, as I expected, but why does Go give me gibberish when I read the line using Go's functionality instead of cgo? Whats wrong with Go that it gives me those values, doesn't it interpret the input bytes as CP850 but another charset? Is there a better Go-only way to handle this problem?
This problem only arises when reading from stdin. When I print out a UTF-8 ä
to stdout it prints correctly in the console.