dos49618 2014-11-05 11:41
浏览 107
已采纳

使用Go解析巨大的XML文件

We need to parse a huge XML file using Go. We'd like to use a SAX-like event based algorithm using xml.NewDecoder() and decoder.Token() library calls. We've created the appropriate struct types with XML annotations. Everything easy peasy so far.

Now, we go through the file and detect the xml.StartElement tokens. And here comes the problem. We need to decode ONLY the attributes of this starting token and continue into its content. If we call token.DecodeElement() the whole content is "decoded" or skipped in our scenario.

How to decode only the attributes of a specific StartElement and continue to the element's body?

  • 写回答

1条回答 默认 最新

  • duanrebo3559 2014-11-06 21:24
    关注

    I parse wikipedia xml dumps (~50GB xml files) in go-wikiparse using plain struct/reflect decoding. It's super simple.

    The strategy is basically this:

    First, read the envelope token:

    d := xml.NewDecoder(r)
    _, err := d.Token()
    if err != nil {
        return nil, err
    }
    

    e.g., for <someDocument><billions-of-other-things/></someDocument> that will give you someDocument.

    Then, you can just struct decode the next things in a loop:

    var i item
    d.Decode(&i)
    

    Not much RAM, and it's super easy to parse.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 需要跳转番茄畅听app的adb命令
  • ¥50 寻找一位有逆向游戏盾sdk 应用程序经验的技术
  • ¥15 请问有用MZmine处理 “Waters SYNAPT G2-Si QTOF质谱仪在MSE模式下采集的非靶向数据” 的分析教程吗
  • ¥50 opencv4nodejs 如何安装
  • ¥15 adb push异常 adb: error: 1409-byte write failed: Invalid argument
  • ¥15 nginx反向代理获取ip,java获取真实ip
  • ¥15 eda:门禁系统设计
  • ¥50 如何使用js去调用vscode-js-debugger的方法去调试网页
  • ¥15 376.1电表主站通信协议下发指令全被否认问题
  • ¥15 物体双站RCS和其组成阵列后的双站RCS关系验证