dongwen7423 2013-12-06 22:05
浏览 90
已采纳

转到:一次通过regexp FindAll和ReplaceAll

I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.

I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.

Is there a way apply both functions simultaneously, or an equivalent regexp?

Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like ) and a regexp can take care of this.

A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY

func main() {
    res, err := http.Get("http://www.elpais.es")
    if err != nil {
        panic(err)
    }

    body, err := ioutil.ReadAll(res.Body)
    fmt.Println("body: ", len(body), cap(body))
    res.Body.Close()
    if err != nil {
        panic(err)
    }

    r := regexp.MustCompile("<li>(.+)</li>")

    // Find all subexpressions, containing the label <li>
    out := r.FindAll(body, -1)

    for i, v := range out[:10] {
        fmt.Printf("%d: %s
", i, v)
    }

    //Replace to remove the label.
    out2 := make([][]byte, len(out))
    for i, v := range out {
        out2[i] = r.ReplaceAll(v, []byte("$1"))
    }

    for i, v := range out2[:10] {
        fmt.Printf("%d: %s
", i, v)
    }
}

By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK :)

  • 写回答

1条回答 默认 最新

  • dongpu8935 2013-12-06 22:44
    关注

    Recommendation: Use goquery for that task, very simple to use and reduces your code by so much. Example:

    doc, _ := goquery.NewDocument("http://www.elpais.es")
    text := doc.Find("li").Slice(10, -1).Text()
    

    Regarding your question, use FindAllSubmatch to extract the match directly:

    r := regexp.MustCompile("<li>(.+)</li>")
    
    // Find all subexpressions, containing the label <li>
    out := r.FindAllSubmatch(body, -1)
    
    for i, v := range out[:10] {
        fmt.Printf("%d: %s
    ", i, v[1])
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 保护模式-系统加载-段寄存器