doudao7511 2015-01-07 20:39
浏览 171
已采纳

goquery-从一个html标签提取文本并将其添加到下一个标签

Yeah, sorry that the title explains nothing. I'll need to use an example.

This is a continuation of another question I posted which solved one problem but not all of them. I've put most of the background info from that question into this one. Also, I've only been looking into Go for about 5 days (and I only started learning code a couple months ago), so I'm 90% sure that I'm close to figuring out what I want and that the problem is that I've got some silly syntax mistakes.

Situation

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). Here's what it looks like:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

Objective

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

Once again, an example explains ^this better. This is what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

Solution Attempts

Because distinguishing further the <h1> tags from the <p> tags would provide more parsing options, I've figured out how to change the class attributes of the <h1> tags to this:

<html>
    <body>
        <h1>
            <span class="title">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="title">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

with this code:

html_code := strings.NewReader(`
code_example_above
`)
doc, _ := goquery.NewDocumentFromReader(html_code)
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
    }
})

I know that I can select the <p..."text" following the <h1..."title" with either doc.Find("h1+p") or s.Next() inside the doc.Find("h1").Each function:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
        fmt.Println(s.Next().Text())
    }
})

I can't figure out how to insert the text from <h1..."title" to <p..."text". I've tried using quite a few variations of s.After(), s.Before(), and s.Append(), e.g., like this:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        s.After(s.Text())
        fmt.Println(s.Next().Text())
    }
})

but I can't figure out how to do exactly what I want.

If I use s.After(s.Next().Text()) instead, I get this error output:

panic: expected identifier, found 5 instead

goroutine 1 [running]:
code.google.com/p/cascadia.MustCompile(0xc2082f09a0, 0x62, 0x62)
    /home/*/go/src/code.google.com/p/cascadia/selector.go:59 +0x77
github.com/PuerkitoBio/goquery.(*Selection).After(0xc2082ea630, 0xc2082f09a0, 0x62, 0x5)
    /home/*/go/src/github.com/PuerkitoBio/goquery/manipulation.go:18 +0x32
main.func·001(0x0, 0xc2082ea630)
    /home/*/go/test2.go:78 +0x106
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc2082ea600, 0x7cb678, 0x2)
    /home/*/go/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0x173
main.ExampleScrape()
    /home/*/go/test2.go:82 +0x213
main.main()
    /home/*/go/test2.go:175 +0x1b

goroutine 9 [runnable]:
net/http.(*persistConn).readLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:928 +0x9ce
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:660 +0xc9f

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 10 [select]:
net/http.(*persistConn).writeLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:661 +0xcbc
exit status 2

(The lines of my script don't match the lines of the examples above, but "line 72" of my script contains the code s.After(s.Next().Text()). I don't know what exactly panic: expected identifier, found 5 instead means.)

Summary

In summary, my problem is that I can't quite wrap my head around how to use goquery to add text to a tag.

I think I'm close. Would any gopher Jedis be able and willing to help this padawan?

  • 写回答

1条回答 默认 最新

  • dongsechuan0535 2015-04-10 20:50
    关注

    Something like this code does the job, it finds all <h1> nodes, then all <span> nodes inside these <h1> nodes, looking for one with class text. Then it gets the next element to the <h1> node, if it is a <p>, that has inside a <span>, then it replaces this last <span> with a new <span> with the new text and removes the <h1>.

    I wonder if it's possible to create nodes using goquery without writing html...

    package main
    
    import (
        "fmt"
        "strings"
    
        "github.com/PuerkitoBio/goquery"
    )
    
    var htmlCode string = `<html>
    ...
    <html>`
    
    func main() {
        doc, _ := goquery.NewDocumentFromReader(strings.NewReader((htmlCode)))
        doc.Find("h1").Each(func(i int, h1 *goquery.Selection) {
            h1.Find("span").Each(func(j int, s *goquery.Selection) {
                if s.HasClass("text") {
                    if p := h1.Next(); p != nil {
                        if ps := p.Children().First(); ps != nil && ps.HasClass("text") {
                            ps.ReplaceWithHtml(
                                fmt.Sprintf("<span class=\"text\">%s%s</span>)", s.Text(), ps.Text()))
                            h1.Remove()
                        }
                    }
                }
            })
        })
        htmlResult, _ := doc.Html()
        fmt.Println(htmlResult)
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看