在Golang中提取* html.Node的位置偏移

How do I can extract positional offset for specific node of already parsed HTML document? For example, for document <div>Hello, <b>World!</b></div> I want to be able to know that offset of World! is 15:21. Document may be changed while parsing.

I have a solution to render whole document with special marks, but it's really bad for performance. Any ideas?

package main

import (
    "bytes"
    "golang.org/x/net/html"
    "golang.org/x/net/html/atom"
    "log"
    "strings"
)

func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
    if node.Type != html.TextNode {
        node = node.FirstChild
    }
    originalData := node.Data

    var buf bytes.Buffer
    node.Data = "|start|" + originalData
    _ = html.Render(&buf, context.FirstChild)
    start := strings.Index(buf.String(), "|start|")

    buf = bytes.Buffer{}
    node.Data = originalData + "|end|"
    _ = html.Render(&buf, context.FirstChild)
    end := strings.Index(buf.String(), "|end|")

    node.Data = originalData
    return start, end
}

func main() {
    s := "<div>Hello, <b>World!</b></div>"
    var context html.Node
    context = html.Node{
        Type:     html.ElementNode,
        Data:     "body",
        DataAtom: atom.Body,
    }
    nodes, err := html.ParseFragment(strings.NewReader(s), &context)
    if err != nil {
        log.Fatal(err)
    }
    for _, node := range nodes {
        context.AppendChild(node)
    }
    world := nodes[0].FirstChild.NextSibling.FirstChild
    log.Println("target", world)
    log.Println(nodeIndexOffset(&context, world))
}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

dpz90118 2016-01-23 06:50

关注

I come up with solution where we extend (please fix me if there's another way to do it) original HTML package with additional custom.go file with new exported function. This function is able to access unexported data property of Tokenizer, which holds exactly start and end position of current Node. We have to adjust positions after each buffer read. See globalBufDif.

I don't really like that I have to fork the package only to access couple of properties, but seems like this is a Go way.

func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
    // Iterate until EOF. Any other error will cause an early return.
    var err error
    var globalBufDif int
    var prevEndBuf int
    var tokenIndex [2]int
    tokenMap := make(map[*Node][2]int)
    for err != io.EOF {
        // CDATA sections are allowed only in foreign content.
        n := p.oe.top()
        p.tokenizer.AllowCDATA(n != nil && n.Namespace != "")

        t := p.top().FirstChild
        for {
            if t != nil && t.NextSibling != nil {
                t = t.NextSibling
            } else {
                break
            }
        }
        tokenMap[t] = tokenIndex
        if prevEndBuf > p.tokenizer.data.end {
            globalBufDif += prevEndBuf
        }
        prevEndBuf = p.tokenizer.data.end
        // Read and parse the next token.
        p.tokenizer.Next()
        tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}

        p.tok = p.tokenizer.Token()
        if p.tok.Type == ErrorToken {
            err = p.tokenizer.Err()
            if err != nil && err != io.EOF {
                return tokenMap, err
            }
        }
        p.parseCurrentToken()
    }
    return tokenMap, nil
}

// ParseFragmentWithIndexes parses a fragment of HTML and returns the nodes
// that were found. If the fragment is the InnerHTML for an existing element,
// pass that element in context.
func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
    contextTag := ""
    if context != nil {
        if context.Type != ElementNode {
            return nil, nil, errors.New("html: ParseFragment of non-element Node")
        }
        // The next check isn't just context.DataAtom.String() == context.Data because
        // it is valid to pass an element whose tag isn't a known atom. For example,
        // DataAtom == 0 and Data = "tagfromthefuture" is perfectly consistent.
        if context.DataAtom != a.Lookup([]byte(context.Data)) {
            return nil, nil, fmt.Errorf("html: inconsistent Node: DataAtom=%q, Data=%q", context.DataAtom, context.Data)
        }
        contextTag = context.DataAtom.String()
    }
    p := &parser{
        tokenizer: NewTokenizerFragment(r, contextTag),
        doc: &Node{
            Type: DocumentNode,
        },
        scripting: true,
        fragment:  true,
        context:   context,
    }

    root := &Node{
        Type:     ElementNode,
        DataAtom: a.Html,
        Data:     a.Html.String(),
    }
    p.doc.AppendChild(root)
    p.oe = nodeStack{root}
    p.resetInsertionMode()

    for n := context; n != nil; n = n.Parent {
        if n.Type == ElementNode && n.DataAtom == a.Form {
            p.form = n
            break
        }
    }

    tokenMap, err := parseWithIndexes(p)
    if err != nil {
        return nil, nil, err
    }

    parent := p.doc
    if context != nil {
        parent = root
    }

    var result []*Node
    for c := parent.FirstChild; c != nil; {
        next := c.NextSibling
        parent.RemoveChild(c)
        result = append(result, c)
        c = next
    }
    return result, tokenMap, nil
}

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

在Golang中提取* html.Node的位置偏移 html
2016-01-15 13:34

回答 2 已采纳 I come up with solution where we extend (please fix me if there's another way to do it) original H
如何在golang中将* multipart.FileHeader文件类型转换为* os.File？
2016-12-04 06:31

回答 1 已采纳 Call Open on the multipart.FileHeader. It will return a multipart.File which will provide a reader
如何在Golang中将multipart.File类型转换为* os.File
2019-07-08 14:14

回答 1 已采纳 This is actually related to the go-tus client. Cloudflare's example creates a tus.Upload from an
Golang面经
2021-12-06 22:42

Yuan_xii的博客 (51)k8s各种组件组建解释 Etcd 存储集群中各资源对象信息 apiserver 资源操作唯一入口 scheduler 集群资源调度，将Pod调度到指定节点 ControllerManager 维护集群状态，自动扩展，故障检测，自动更新 node 为Pod...
在Golang中制作模拟gin.Context
2017-01-19 13:28

回答 4 已采纳 If to reduce the question to "How to create mock for a function argument?" the answer is: use inte
将图像从* image.YCbCr转换为* image.RGBA
2017-11-28 15:42

回答 1 已采纳 As the comments suggest, you have to create a new image and draw into it: b := thumbnail.Bounds()
求助：使用生成的Golang DLL返回字符串或* C.Char c#
2018-01-11 13:20

回答 1 已采纳 Anyway, after some time trying and error"ing", this is the solution Go //export PrintHello2 func
golang个人整理知识点
2021-12-13 23:23

闲落~的博客个人整理golang全面知识点
如何使用NewDecode，Golang和req * http.Request解码和映射JSON对象 json
2015-11-23 20:35

回答 2 已采纳 Tag your struct fields so the decoder knows how to map the json keys to your struct. type Message
无法将类型* json.RawMessage的表达式转换为golang中的[] byte类型 json
2018-10-16 02:40

回答 1 已采纳 A *json.RawMesasge is not a []byte. It's a pointer. Dereference the pointer to convert to a slic
Golang构建错误：无法将* sqlx.DB分配给* sql.DB
2017-08-17 14:20

回答 1 已采纳 sqlx.Open return a sqlx.DB struct (here the definition) that is different from the DB struct defin
golang大厂面试2
2023-07-04 14:42

theo.wu的博客一开始一个项目数据比较多，后来需要分库分表，有什么思路在不停服务的情况下做到平滑切换？wss是基于tcp的，tcp有个半连接队列，有没有遇到发了信令但是服务器没收到的情况？实现一个函数，有两个参数分别是升序的...
在goLang中，谁了解* var.Type
2014-05-15 13:58

回答 1 已采纳 The Go Programming Language Specification Function types A function type denotes the
Golang底层原理学习笔记（一）
2023-01-13 23:57

lcy～的博客 Golang底层原理
golang大厂面试1
2023-06-11 21:42

theo.wu的博客 Golang字节面试经验分享第一面面试官首先介绍说会有几轮面试算法题 1.1将整数转换二进制然后将负数变成。
没有解决我的问题, 去提问

悬赏问题

¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
¥20 腾讯企业邮箱邮件可以恢复么
¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗？
¥15 错误 LNK2001 无法解析的外部符号
¥50 安装pyaudiokits失败
¥15 计组这些题应该咋做呀
¥60 更换迈创SOL6M4AE卡的时候，驱动要重新装才能使用，怎么解决？
¥15 让node服务器有自动加载文件的功能
¥15 jmeter脚本回放有的是对的有的是错的
¥15 r语言蛋白组学相关问题

码龄粉丝数原力等级 --

在Golang中提取* html.Node的位置偏移

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

在Golang中提取* html.Node的位置偏移

2条回答 默认 最新

悬赏问题

2条回答默认最新