dongwen7423
dongwen7423
2013-12-06 22:05

转到:一次通过regexp FindAll和ReplaceAll

  • regex
已采纳

I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.

I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.

Is there a way apply both functions simultaneously, or an equivalent regexp?

Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like ) and a regexp can take care of this.

A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY

func main() {
    res, err := http.Get("http://www.elpais.es")
    if err != nil {
        panic(err)
    }

    body, err := ioutil.ReadAll(res.Body)
    fmt.Println("body: ", len(body), cap(body))
    res.Body.Close()
    if err != nil {
        panic(err)
    }

    r := regexp.MustCompile("<li>(.+)</li>")

    // Find all subexpressions, containing the label <li>
    out := r.FindAll(body, -1)

    for i, v := range out[:10] {
        fmt.Printf("%d: %s
", i, v)
    }

    //Replace to remove the label.
    out2 := make([][]byte, len(out))
    for i, v := range out {
        out2[i] = r.ReplaceAll(v, []byte("$1"))
    }

    for i, v := range out2[:10] {
        fmt.Printf("%d: %s
", i, v)
    }
}

By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK :)

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

1条回答

  • dongpu8935 dongpu8935 8年前

    Recommendation: Use goquery for that task, very simple to use and reduces your code by so much. Example:

    doc, _ := goquery.NewDocument("http://www.elpais.es")
    text := doc.Find("li").Slice(10, -1).Text()
    

    Regarding your question, use FindAllSubmatch to extract the match directly:

    r := regexp.MustCompile("<li>(.+)</li>")
    
    // Find all subexpressions, containing the label <li>
    out := r.FindAllSubmatch(body, -1)
    
    for i, v := range out[:10] {
        fmt.Printf("%d: %s
    ", i, v[1])
    }
    
    点赞 评论 复制链接分享