drecy22400 2013-08-16 13:29 采纳率: 0%
浏览 57
已采纳

我如何获取html.Node的内容

I would like to get data from a URL using the GO 3rd party library from http://godoc.org/code.google.com/p/go.net/html . But I came across a problem, that is I couldn't get the content of an html.Node.

There's an example code in the reference document, and here's the code.

s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
    log.Fatal(err)
}
var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, a := range n.Attr {
            if a.Key == "href" {
                fmt.Println(a.Val)
                break
            }
        }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}
f(doc)

The output is:

foo
/bar/baz

If I want to get

Foo
BarBaz

What should I do?

  • 写回答

1条回答 默认 最新

  • douwei7501 2013-08-16 14:10
    关注

    The tree of <a href="link"><strong>Foo</strong>Bar</a> looks basically like this:

    • ElementNode "a" (this node also includes a list off attributes)
      • ElementNode "strong"
        • TextNode "Foo"
      • TextNode "Bar"

    So, assuming that you want to get the plain text of the link (e.g. FooBar) you would have to walk trough the tree and collect all text nodes. For example:

    func collectText(n *html.Node, buf *bytes.Buffer) {
        if n.Type == html.TextNode {
            buf.WriteString(n.Data)
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            collectText(c, buf)
        }
    }
    

    And the changes in your function:

    var f func(*html.Node)
    f = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            text := &bytes.Buffer{}
            collectText(n, text)
            fmt.Println(text)
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 抖音咸鱼付款链接转码支付宝
  • ¥15 ubuntu22.04上安装ursim-3.15.8.106339遇到的问题
  • ¥15 求螺旋焊缝的图像处理
  • ¥15 blast算法(相关搜索:数据库)
  • ¥15 请问有人会紧聚焦相关的matlab知识嘛?
  • ¥15 网络通信安全解决方案
  • ¥50 yalmip+Gurobi
  • ¥20 win10修改放大文本以及缩放与布局后蓝屏无法正常进入桌面
  • ¥15 itunes恢复数据最后一步发生错误
  • ¥15 关于#windows#的问题:2024年5月15日的win11更新后资源管理器没有地址栏了顶部的地址栏和文件搜索都消失了