douzhi19900102 2014-10-01 02:50
浏览 231
已采纳

HTML-查找给定标签中的所有子标签

Assume I have a html page that contains something like

<ul class ="good">
    <li>1</li>
    <li>2</li>
    <li>3</li>
</ul>

<ul class ="bad">
    <li>a</li>
    <li>b</li>
    <li>c</li>
</ul>

I want to grab the <li> elements inside the first <ul>. From here I have basically copied (note: edited code per @twotwotwo comment)

page, _ := html.Parse(httpBody)
    var f func(*html.Node)
    f = func(n *html.Node) {
        //fmt.Println("Inside f")
        if n.Type == html.ElementNode && n.Data == "ul" {
            fmt.Println("ul found ->  ",n)
            for c := n.FirstChild; c != nil; c = c.NextSibling {
                f(c)
            }
        } else {
          fmt.Println(n.Data ,"is not the correct one")
          for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) }
          }
    }
f(page)

But the only output I obtain is

 is not the correct one
html is not the correct one
head is not the correct one
body is not the correct one

I wonder why the recursion stops at body. I have tried with motherfuckingwebsite.com which has tags inside the body

P.S. I have also tried

page := html.NewTokenizer(httpBody)

for {
    tokenType := page.Next()
    if tokenType == html.ErrorToken {
        return links
    }
    token := page.Token()

but this seem to show all the tokens, without caring about the tree structure.

EDIT:

  • 写回答

1条回答 默认 最新

  • doukuang1897 2014-10-01 04:33
    关注

    I have, in the past, used this package: https://github.com/PuerkitoBio/goquery

    It provides a "jQuery-like" interface/querying across HTML documents. With that library, its as simple as this:

    import (
        "bytes"
        "fmt"
        "log"
    
        "github.com/PuerkitoBio/goquery"
    )
    
    var httpBody string = `
        <ul class ="good">
            <li>1</li>
            <li>2</li>
            <li>3</li>
        </ul>
    
        <ul class ="bad">
            <li>a</li>
            <li>b</li>
            <li>c</li>
        </ul>
    `
    
    func main() {
        b := bytes.NewBufferString(httpBody)
        doc, err := goquery.NewDocumentFromReader(b)
        if err != nil {
            log.Fatal(err)
        }
    
        doc.Find("ul.good").Each(func(i int, ul *goquery.Selection) {
            ul.Find("li").Each(func(i int, li *goquery.Selection) {
                fmt.Println(li.Text())
            })
        })
    }
    

    Which prints:

    1
    2
    3
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题