doudihuang7642
2018-06-14 15:08 阅读 119

如何在Go中改善文件编码转换

I've been working with some huge files that I have to convert to UTF-8, as the files ar enormous traditional tools like iconv won't work. So I decided to write my own tool in Go, however I noticed that this encoding conversion is quite slow in Go. here is my code:

package main

import (
    "fmt"
    "io"
    "log"
    "os"

    "golang.org/x/text/encoding/charmap"
)

func main() {
    if len(os.Args) != 3 {
        fmt.Fprintf(os.Stderr, "usage:
\t%s [input] [output]
", os.Args[0])
        os.Exit(1)
    }

    f, err := os.Open(os.Args[1])

    if err != nil {
        log.Fatal(err)
    }

    out, err := os.Create(os.Args[2])

    if err != nil {
        log.Fatal(err)
    }

    r := charmap.ISO8859_1.NewDecoder().Reader(f)

    buf := make([]byte, 1048576)

    io.CopyBuffer(out, r, buf)

    out.Close()
    f.Close()
}

Similar code in Python is much more performant:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open("FRWAC-01.xml", "r", "latin_1") as sourceFile:
    with codecs.open("FRWAC-01-utf8.xml", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

I was sure my Go code would be much quicker because in general I/O in Go is fast, but it turns out is much slower than the Python code. Is there a way to improve the Go program?

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

1条回答 默认 最新

  • 已采纳
    doupao5296 doupao5296 2018-06-14 18:10

    The problem here is that you're not comparing the same code in both cases. Also IO speed in Go can't be significantly different that python, since they are making the same syscalls.

    In the python version, the files are buffered by default. In the Go version, while you're using io.CopyBuffer with a 1048576 byte buffer, the decoder is going to make whatever size Read calls it needs directly on the unbuffered file.

    Wrapping the file IO with bufio will produce comparable results.

    inFile, err := os.Open(os.Args[1])
    if err != nil {
        log.Fatal(err)
    }
    defer inFile.Close()
    
    outFile, err := os.Create(os.Args[2])
    if err != nil {
        log.Fatal(err)
    }
    defer outFile.Close()
    
    in := bufio.NewReaderSize(inFile, 1<<20)
    
    out := bufio.NewWriterSize(outFile, 1<<20)
    defer out.Flush()
    
    r := charmap.ISO8859_1.NewDecoder().Reader(in)
    
    if _, err := io.Copy(out, r); err != nil {
        log.Fatal(err)
    }
    
    点赞 评论 复制链接分享

相关推荐