dpwdldgn43486 2015-01-05 23:04
浏览 40
已采纳

goquery-将标记与后面的标记连接

For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

So this is what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

With the code starting off like this,

package main

import (
    "fmt"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

I know that I can read <h1..."text" with:

h3_tag := doc.Find("h3 .text")

I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:

doc.Find("p .text").Before("h3 .text")

^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".

Then, I found out how to get a step closer to what I want:

doc.Find("p .text").First().Before("h3 .text")

^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).

I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)

My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.

If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.

My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.

EDIT

I found out something else I can do:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, "

")
})

^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?

Or, as mattn pointed out, I could also use doc.Find("h1+p").

I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.

  • 写回答

1条回答 默认 最新

  • douan3019 2015-01-06 05:27
    关注

    I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.

    h1+p
    

    This returns h1 tags which has p tag in neighbor.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line
  • ¥500 火焰左右视图、视差(基于双目相机)
  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?