dpwdldgn43486 2015-01-05 23:04
浏览 40
已采纳

goquery-将标记与后面的标记连接

For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

So this is what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

With the code starting off like this,

package main

import (
    "fmt"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

I know that I can read <h1..."text" with:

h3_tag := doc.Find("h3 .text")

I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:

doc.Find("p .text").Before("h3 .text")

^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".

Then, I found out how to get a step closer to what I want:

doc.Find("p .text").First().Before("h3 .text")

^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).

I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)

My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.

If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.

My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.

EDIT

I found out something else I can do:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, "

")
})

^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?

Or, as mattn pointed out, I could also use doc.Find("h1+p").

I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.

  • 写回答

1条回答 默认 最新

  • douan3019 2015-01-06 05:27
    关注

    I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.

    h1+p
    

    This returns h1 tags which has p tag in neighbor.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角
  • ¥20 Java-Oj-桌布的计算
  • ¥15 powerbuilder中的datawindow数据整合到新的DataWindow
  • ¥20 有人知道这种图怎么画吗?
  • ¥15 pyqt6如何引用qrc文件加载里面的的资源
  • ¥15 安卓JNI项目使用lua上的问题
  • ¥20 RL+GNN解决人员排班问题时梯度消失
  • ¥60 要数控稳压电源测试数据
  • ¥15 能帮我写下这个编程吗
  • ¥15 ikuai客户端l2tp协议链接报终止15信号和无法将p.p.p6转换为我的l2tp线路