doudao7511 2015-01-07 20:39
浏览 171
已采纳

goquery-从一个html标签提取文本并将其添加到下一个标签

Yeah, sorry that the title explains nothing. I'll need to use an example.

This is a continuation of another question I posted which solved one problem but not all of them. I've put most of the background info from that question into this one. Also, I've only been looking into Go for about 5 days (and I only started learning code a couple months ago), so I'm 90% sure that I'm close to figuring out what I want and that the problem is that I've got some silly syntax mistakes.

Situation

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). Here's what it looks like:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

Objective

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

Once again, an example explains ^this better. This is what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

Solution Attempts

Because distinguishing further the <h1> tags from the <p> tags would provide more parsing options, I've figured out how to change the class attributes of the <h1> tags to this:

<html>
    <body>
        <h1>
            <span class="title">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="title">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

with this code:

html_code := strings.NewReader(`
code_example_above
`)
doc, _ := goquery.NewDocumentFromReader(html_code)
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
    }
})

I know that I can select the <p..."text" following the <h1..."title" with either doc.Find("h1+p") or s.Next() inside the doc.Find("h1").Each function:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
        fmt.Println(s.Next().Text())
    }
})

I can't figure out how to insert the text from <h1..."title" to <p..."text". I've tried using quite a few variations of s.After(), s.Before(), and s.Append(), e.g., like this:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        s.After(s.Text())
        fmt.Println(s.Next().Text())
    }
})

but I can't figure out how to do exactly what I want.

If I use s.After(s.Next().Text()) instead, I get this error output:

panic: expected identifier, found 5 instead

goroutine 1 [running]:
code.google.com/p/cascadia.MustCompile(0xc2082f09a0, 0x62, 0x62)
    /home/*/go/src/code.google.com/p/cascadia/selector.go:59 +0x77
github.com/PuerkitoBio/goquery.(*Selection).After(0xc2082ea630, 0xc2082f09a0, 0x62, 0x5)
    /home/*/go/src/github.com/PuerkitoBio/goquery/manipulation.go:18 +0x32
main.func·001(0x0, 0xc2082ea630)
    /home/*/go/test2.go:78 +0x106
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc2082ea600, 0x7cb678, 0x2)
    /home/*/go/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0x173
main.ExampleScrape()
    /home/*/go/test2.go:82 +0x213
main.main()
    /home/*/go/test2.go:175 +0x1b

goroutine 9 [runnable]:
net/http.(*persistConn).readLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:928 +0x9ce
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:660 +0xc9f

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 10 [select]:
net/http.(*persistConn).writeLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:661 +0xcbc
exit status 2

(The lines of my script don't match the lines of the examples above, but "line 72" of my script contains the code s.After(s.Next().Text()). I don't know what exactly panic: expected identifier, found 5 instead means.)

Summary

In summary, my problem is that I can't quite wrap my head around how to use goquery to add text to a tag.

I think I'm close. Would any gopher Jedis be able and willing to help this padawan?

  • 写回答

1条回答 默认 最新

  • dongsechuan0535 2015-04-10 20:50
    关注

    Something like this code does the job, it finds all <h1> nodes, then all <span> nodes inside these <h1> nodes, looking for one with class text. Then it gets the next element to the <h1> node, if it is a <p>, that has inside a <span>, then it replaces this last <span> with a new <span> with the new text and removes the <h1>.

    I wonder if it's possible to create nodes using goquery without writing html...

    package main
    
    import (
        "fmt"
        "strings"
    
        "github.com/PuerkitoBio/goquery"
    )
    
    var htmlCode string = `<html>
    ...
    <html>`
    
    func main() {
        doc, _ := goquery.NewDocumentFromReader(strings.NewReader((htmlCode)))
        doc.Find("h1").Each(func(i int, h1 *goquery.Selection) {
            h1.Find("span").Each(func(j int, s *goquery.Selection) {
                if s.HasClass("text") {
                    if p := h1.Next(); p != nil {
                        if ps := p.Children().First(); ps != nil && ps.HasClass("text") {
                            ps.ReplaceWithHtml(
                                fmt.Sprintf("<span class=\"text\">%s%s</span>)", s.Text(), ps.Text()))
                            h1.Remove()
                        }
                    }
                }
            })
        })
        htmlResult, _ := doc.Html()
        fmt.Println(htmlResult)
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法
  • ¥15 可否在不同线程中调用封装数据库操作的类
  • ¥15 微带串馈天线阵列每个阵元宽度计算
  • ¥15 keil的map文件中Image component sizes各项意思