doutang3760
doutang3760
2015-04-16 21:55

在Go中使用bufio扫描仪读取unicode字符

已采纳

I'm trying to read a plain text file that contains names like this: "CASTAÑEDA"

The code is basically like this:

file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
    log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
    fmt.Println(scanner.Text())
}

Then, when "CASTAÑEDA" is read it prints "CASTA�EDA"

There's any way to handle that characters when reading with bufio?

Thanks.

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

2条回答

  • dongxindu8753 dongxindu8753 6年前

    The issue you're encountering is that your input is likely not UTF-8 (which is what bufio and most of the Go language/stdlib expect). Instead, your input probably uses some extended-ASCII codepage, which is why the unaccented characters are passing through cleanly (UTF-8 is also a superset of 7-bit ASCII), but that the 'Ñ' is not passed through intact.

    In this situation, the bit-representation of the accented character is not valid UTF-8, so the unicode replacement character (U+FFFD) is being produced. You've got a few options:

    1. Convert your input files to UTF-8 before passing them to Go. There are many utilities that can do this, and editors often have this feature.
    2. Try using golang.org/x/text/encoding/charmap together with NewReader from golang.org/x/text/transform to transform your input to UTF-8. Pass the resulting Reader to bufio.NewScanner
    3. Change the line in the loop to os.Stdout.Write(scanner.Bytes()); fmt.Println(); This might avoid the bytes being interpreted as UTF-8 beyond newline splitting. Writing the bytes directly to os.Stdout will further avoid any (mis)interpretation of the contents.
    点赞 评论 复制链接分享
  • dpt8910 dpt8910 6年前

    Your file is, most propably, non UTF-8. Because of that (go expects all strings to be UTF-8) your console output looks mangled. I would advise usage of the packages golang.org/x/text/encoding/charmap and golang.org/x/text/transform in your case, to convert the file's data to UTF-8. As I might presume, looking at your file path, you are on Windows. So your character encoding might be Windows1252 (if you have edited it e.g. with notepad.exe).

    Try something like this:

    package main
    
    import (
        "bufio"
        "fmt"
        "log"
        "os"
    
        "golang.org/x/text/encoding/charmap"
        "golang.org/x/text/transform"
    )
    
    func main() {
        file, err := os.Open("C:/temp/file.txt")
        defer file.Close()
        if err != nil {
            log.Fatal(err)
        }
    
        dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) <- insert your enconding here
    
        scanner := bufio.NewScanner(dec)
        for scanner.Scan() {
            fmt.Println(scanner.Text())
        }
    }
    

    You can find more encodings in the package golang.org/x/text/encoding/charmap, that you can insert into my example to your liking.

    点赞 评论 复制链接分享