dongsi5381 2017-11-16 21:30
浏览 15
已采纳

从net / html令牌生成器获取流中的当前位置

I'm trying to figure out if there's a way to get the current character position of a tag using the golang.org/x/net/html tokenizer library?

Simplified code looks like:

func LookForForm(body string) {
    reader := strings.NewReader(body)
    tokenizer := html.NewTokenizer(reader)
    idx := 0
    lastIdx := 0
    for {
        token := tokenizer.Next()
        lastIdx = idx
        idx = int(reader.Size()) - int(reader.Len())
        switch token {
        case html.ErrorToken:
            return
        case html.StartTagToken:
            t := tokenizer.Token()
            tagName := strings.ToLower(t.Data)
            if tagName == "form" {
                fmt.Printf("found at form at %d
", lastIdx)
                return
            }
        }
    }
}

This doesn't work (I think) because reader is not reading character-by-character but by chunks so my calculation of Size - Len is invalid. tokenizer maintains two private span structs ( https://github.com/golang/net/blob/master/html/token.go line 147) but I am unaware of how to access them.

One possible solution that just occurred to me is to make a "reader" that only reads a single character at a time so my Size and Len calculations are always correct. But, that seems like a hack and any suggestions would be appreciated.

  • 写回答

2条回答 默认 最新

  • dounuo9921 2017-11-17 20:34
    关注

    A non-buffering reader ended up working ok for me. The implementation of the reader looks something like:

    package rule
    
    import (
        "errors"
        "io"
        "unicode/utf8"
    )
    
    type Reader struct {
        s        string
        i        int64
        z        int64
        prevRune int64 // index of the previously read rune or -1
    }
    
    func (r *Reader) String() string {
        return r.s
    }
    
    func (r *Reader) Len() int {
        if r.i >= r.z {
            return 0
        }
        return int(r.z - r.i)
    }
    
    
    func (r *Reader) Size() int64 {
        return r.z 
    }
    
    
    func (r *Reader) Pos() int64 {
        return r.i
    }
    
    
    func (r *Reader) Read(b []byte) (int, error) {
        if r.i >= r.z {
            return 0, io.EOF
         }
    
        r.prevRune = -1
        b[0] = r.s[r.i]
        r.i += 1
        return 1, nil
    }
    

    Then the loop for the tokenizer is fairly easy to calculate:

        reader := NewReader(body)
        tokenizer := html.NewTokenizer(reader)
        idx := 0
        lastIdx := 0
    tokenLoop:
        for {
            token := tokenizer.Next()
            switch token {
            case html.ErrorToken:
                break tokenLoop
            case html.EndTagToken, html.TextToken, html.CommentToken, html.SelfClosingTagToken:
                lastIdx = int(reader.Pos())
            case html.StartTagToken:
                t := tokenizer.Token()
                tagName := strings.ToLower(t.Data)
                idx = int(reader.Pos())
                if tagName == "form" {
                    fmt.Printf("found at form at %d
    ", lastIdx)
                    return
                }
            }
        }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 c程序不知道为什么得不到结果
  • ¥40 复杂的限制性的商函数处理
  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置