douzhou7037 2018-12-31 10:21
浏览 241
已采纳

读取和解析大型XML文件的性能问题

I have a directory which contains several large XML files (total size is about 10 GB). Is there any way to iterate through the directory containing the XML files and read 50 byte by 50 byte and parse the XML files with high performance?

func (mdc *Mdc) Loadxml(path string, wg sync.WaitGroup) {
    defer wg.Done()
    //var conf configuration
    file, err := os.Open(path)
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    scanner := bufio.NewScanner(file)
    buf := make([]byte, 1024*1024)
    scanner.Buffer(buf, 50)
    for scanner.Scan() {
        _, err := file.Read(buf)
        if err != nil {
            log.Fatal(err)
        }
    }

    err = xml.Unmarshal(buf, &mdc)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(mdc)
}
  • 写回答

2条回答 默认 最新

  • douhuan1257 2018-12-31 12:08
    关注

    You can do something even better: You can tokenize your xml files.

    Say you have an xml like this

    <inventory>
      <item name="ACME Unobtainium">
        <tag>Foo</tag>
        <count>1</count>
      </item>
      <item name="Dirt">
        <tag>Bar</tag>
        <count>0</count>
      </item>
    </inventory>
    

    you can actually have the following data model

    type Inventory struct {
        Items []Item `xml:"item"`
    }
    
    type Item struct {
        Name  string   `xml:"name,attr"`
        Tags  []string `xml:"tag"`
        Count int      `xml:"count"`
    }
    

    Now, all you have to do is to use filepath.Walk and do something like this for each file you want to process:

        decoder := xml.NewDecoder(file)
    
        for {
            // Read tokens from the XML document in a stream.
            t, err := decoder.Token()
    
            // If we are at the end of the file, we are done
            if err == io.EOF {
                log.Println("The end")
                break
            } else if err != nil {
                log.Fatalf("Error decoding token: %s", err)
            } else if t == nil {
                break
            }
    
            // Here, we inspect the token
            switch se := t.(type) {
    
            // We have the start of an element.
            // However, we have the complete token in t
            case xml.StartElement:
                switch se.Name.Local {
    
                // Found an item, so we process it
                case "item":
                    var item Item
    
                    // We decode the element into our data model...
                    if err = decoder.DecodeElement(&item, &se); err != nil {
                        log.Fatalf("Error decoding item: %s", err)
                    }
    
                    // And use it for whatever we want to
                    log.Printf("'%s' in stock: %d", item.Name, item.Count)
    
                    if len(item.Tags) > 0 {
                        log.Println("Tags")
                        for _, tag := range item.Tags {
                            log.Printf("\t%s", tag)
                        }
                    }
                }
            }
        }
    

    Working example with dummy XML: https://play.golang.org/p/MiLej7ih9Jt

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 目标计数模型训练过程中的问题
  • ¥100 Acess连接SQL 数据库后 不能用中文筛选
  • ¥15 用友U9Cloud的webapi
  • ¥20 电脑拓展屏桌面被莫名遮挡
  • ¥20 ensp,用局域网解决
  • ¥15 Python语言实验
  • ¥15 我每周要在投影仪优酷上自动连续播放112场电影,我每一周遥控操作一次投影仪,并使得电影永远不重复播放,请问怎样操作好呢?有那么多电影看吗?
  • ¥20 电脑重启停留在grub界面,引导出错需修复
  • ¥15 matlab透明图叠加
  • ¥50 基于stm32l4系列 使用blunrg-ms的ble gatt 创建 hid 服务失败