drgdn82648 2014-01-29 03:50
浏览 61
已采纳

以通用方式使用html.ParseFragment

Using the experimental code.google.com/p/go.net/html package, we can use ParseFragment to parse some sub-section of an HTML document.

Like this:

var s = `
    <option id="foo">first</option>
    <option Class="tester">second</option>
    <option>third</option>
`
doc, err := html.ParseFragment(strings.NewReader(s), &html.Node{
    Type: html.ElementNode,
    Data: "body",
    DataAtom: atom.Body,
})

This works fine for most elements. But it doesn't seem to work when certain elements are at the root position of the HTML, like tbody, tr, and td (and perhaps others, not sure). It simply ignores the tags and only gives the text content.

This can be remedied by providing the semantically correct parent instead of atom.Body, but that requires that we know in advance what the HTML will be.

I'd hoped there was a generic root like atom.DocumentFragment, but I don't see that. So is there some way to use this in such a manner that it'll work with any arbitrary HTML fragment?

  • 写回答

1条回答 默认 最新

  • dongyuandou2521 2014-01-30 16:34
    关注

    ParseFragment is always context-sensitive because it follows the HTML5 fragment-parsing algorithm. That algorithm is designed for implementing the DOM innerHTML property, and the correct tree to generate from a given innerHTML string depends on the surrounding context (especially whether the context is in a table or not).

    So the html package has no way to parse an HTML fragment independently of its context.

    If you need more information about how the parsing depends on the context, see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#reset-the-insertion-mode-appropriately

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题