dongtong848825 2016-01-23 00:28
浏览 18

开始:一次仅解码一个XML节点

Looking through the sourcecode for encoding/xml package, all of the unmarshaling logic (which decodes the actual XML nodes and types them) is in unmarshal and the only way to invoke this is essentially by calling DecodeElement. However, the unmarshaling logic also inherently searches-out the next EndElement. The predominant reason for this seems to be validation. However, this seems to represent a major design flaw to me: What if I have a massive XML file, I am sufficiently confident in its structure, and I'd just like to decode a single node at a time so that I can efficiently filter through the data on-the-fly? The RawToken() call can be used to get the current tag, which is great, but, obviously, when you call DecodeElement() on it, there's an error when the inevitable unmarshal() call apparently starts running into nodes in a way that it perceives as unbalanced.

It seems theoretically possible to encounter a token that I'd like to decode, capture the offset, decode the element, seek back to the previous position, and loop, but that'd still result in a massive amount of unnecessary processing.

Is there no way to just parse one node at a time?

  • 写回答

1条回答 默认 最新

  • duanaozhong0696 2016-01-23 12:42
    关注

    What you describe is called XML stream parsing as it is done by any SAX parser, for example. Good news: encoding/xml supports that, albeit it is a bit hidden.

    What you actually have to do is to create an instance of xml.Decoder, passing an io.Reader. Then you will use Decoder.Token() to read the input stream until the next valid xml token found. From there, you can decide what to do next.

    Here is a little example also available as gist, or you can <kbd>Run it on PlayGround</kbd>:

    package main
    
    import (
        "bytes"
        "encoding/xml"
        "fmt"
    )
    
    const (
        book = `<?xml version="1.0" encoding="UTF-8"?>
    <book>
      <preface>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</preface>
      <chapter num="1" title="Foo">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</chapter>
      <chapter num="2" title="Bar">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</chapter>
    </book>`
    )
    
    type Chapter struct {
        Num     int    `xml:"num,attr"`
        Title   string `xml:"title,attr"`
        Content string `xml:",chardata"`
    }
    
    func main() {
    
        // We emulate a file or network stream
        b := bytes.NewBufferString(book)
    
        // And set up a decoder
        d := xml.NewDecoder(b)
    
        for {
    
            // We look for the next token
            // Note that this only reads until the next positively identified
            // XML token in the stream
            t, err := d.Token()
    
            if err != nil  {
                break
            }
    
            switch et := t.(type) {
    
            case xml.StartElement:
                // We now have to inspect wether we are interested in the element
                // otherwise we will advance
                if et.Name.Local == "chapter" {
                    // Most often/likely element first
    
                    c := &Chapter{}
    
                    // We decode the element into(automagically advancing the stream)
                    // If no matching token is found, there will be an error
                    // Note the search only happens within the parent.
                    if err := d.DecodeElement(&c, &et); err != nil {
                        panic(err)
                    }
    
                    // We have found what we are interested in, so we print it
                    fmt.Printf("%d: %s
    ", c.Num, c.Title)
    
                } else if et.Name.Local == "book" {
                    fmt.Println("Book begins!")
                }
    
            case xml.EndElement:
    
                if et.Name.Local != "book" {
                    continue
                }
    
                fmt.Println("Finished processing book!")
            }
        }
    }
    
    评论

报告相同问题?

悬赏问题

  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器