dougao1542 2017-09-09 21:13
浏览 74

Standart xml解析器在Golang中的性能非常低

I have a 100Gb sized xml file and parse it with SAX method in go with this code

file, err := os.Open(filename)
handle(err)
defer file.Close()
buffer := bufio.NewReaderSize(file, 1024*1024*256) // 33554432
decoder := xml.NewDecoder(buffer)
for {
        t, _ := decoder.Token()
        if t == nil {
            break
        }
        switch se := t.(type) {
        case xml.StartElement:
            if se.Name.Local == "House" {
                house := House{}
                err := decoder.DecodeElement(&house, &se)
                handle(err)
            }
        }
    }

But golang working very slow, its seems by execution time and disk usage. My hdd capable to read data with speed around 100-120 mb/s, but golang uses only 10-13 mb/s. For experiment i rewrite this code in c#:

using (XmlReader reader = XmlReader.Create(filename)
            {
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            if (reader.Name == "House")
                            {
                                //Code
                            }
                            break;
                    }
                }
            }

And i got full hdd loaded, c# read data with 100-110mb/s speed. And execution time around 10 times lower.

How can i improve xml parse performance using golang?

  • 写回答

1条回答 默认 最新

  • drqyxkzbs21968684 2019-06-12 10:57
    关注

    To answer your question "How can i improve xml parse performance using golang?"

    Using the common xml.NewDecoder / decoder.Token, I was seeing 50 MB/s locally. By using https://github.com/tamerh/xml-stream-parser I was able to double the parse speed.

    To test I used Posts.xml (68 GB) from the https://archive.org/details/stackexchange archive torrent.

    package main
    
    import (
        "bufio"
        "fmt"
        "github.com/tamerh/xml-stream-parser"
        "os"
        "time"
    )
    
    func main() {
        // Using `Posts.xml` (68 GB) from https://archive.org/details/stackexchange (in the torrent)
        f, err := os.Open("Posts.xml")
        if err != nil {
            panic(err)
        }
        defer f.Close()
    
        br := bufio.NewReaderSize(f, 1024*1024)
        parser := xmlparser.NewXmlParser(br, "row")
    
        started := time.Now()
        var previous int64 = 0
    
        for x := range *parser.Stream() {
            elapsed := int64(time.Since(started).Seconds())
            if elapsed > previous {
                kBytesPerSecond := int64(parser.TotalReadSize) / elapsed / 1024
                fmt.Printf("%ds elapsed, read %d kB/s (last post.Id %s)", elapsed, kBytesPerSecond, x.Attrs["Id"])
                previous = elapsed
            }
        }
    }
    

    This will output something along the line of:

    ...s elapsed, read ... kB/s (last post.Id ...)
    

    Only caveat is that this does not give you the convenient unmarshal into struct.

    As discussed in https://github.com/golang/go/issues/21823, speed seems to be general problem with the XML implementation in Golang and would require a rewrite / rethink of that part of the standard library.

    评论

报告相同问题?

悬赏问题

  • ¥15 昨天挂载了一下u盘,然后拔了
  • ¥30 win from 窗口最大最小化,控件放大缩小,闪烁问题
  • ¥20 易康econgnition精度验证
  • ¥15 msix packaging tool打包问题
  • ¥28 微信小程序开发页面布局没问题,真机调试的时候页面布局就乱了
  • ¥15 python的qt5界面
  • ¥15 无线电能传输系统MATLAB仿真问题
  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接