dos49618 2014-11-05 11:41
浏览 107
已采纳

使用Go解析巨大的XML文件

We need to parse a huge XML file using Go. We'd like to use a SAX-like event based algorithm using xml.NewDecoder() and decoder.Token() library calls. We've created the appropriate struct types with XML annotations. Everything easy peasy so far.

Now, we go through the file and detect the xml.StartElement tokens. And here comes the problem. We need to decode ONLY the attributes of this starting token and continue into its content. If we call token.DecodeElement() the whole content is "decoded" or skipped in our scenario.

How to decode only the attributes of a specific StartElement and continue to the element's body?

  • 写回答

1条回答 默认 最新

  • duanrebo3559 2014-11-06 21:24
    关注

    I parse wikipedia xml dumps (~50GB xml files) in go-wikiparse using plain struct/reflect decoding. It's super simple.

    The strategy is basically this:

    First, read the envelope token:

    d := xml.NewDecoder(r)
    _, err := d.Token()
    if err != nil {
        return nil, err
    }
    

    e.g., for <someDocument><billions-of-other-things/></someDocument> that will give you someDocument.

    Then, you can just struct decode the next things in a loop:

    var i item
    d.Decode(&i)
    

    Not much RAM, and it's super easy to parse.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 聚类分析或者python进行数据分析
  • ¥15 如何用visual studio code实现html页面
  • ¥15 逻辑谓词和消解原理的运用
  • ¥15 三菱伺服电机按启动按钮有使能但不动作
  • ¥15 js,页面2返回页面1时定位进入的设备
  • ¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
  • ¥15 (希望可以解决问题)ma和mb文件无法正常打开,打开后是空白,但是有正常内存占用,但可以在打开Maya应用程序后打开场景ma和mb格式。
  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?