dongwen7423 2013-12-06 22:05
浏览 90
已采纳

转到:一次通过regexp FindAll和ReplaceAll

I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.

I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.

Is there a way apply both functions simultaneously, or an equivalent regexp?

Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like ) and a regexp can take care of this.

A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY

func main() {
    res, err := http.Get("http://www.elpais.es")
    if err != nil {
        panic(err)
    }

    body, err := ioutil.ReadAll(res.Body)
    fmt.Println("body: ", len(body), cap(body))
    res.Body.Close()
    if err != nil {
        panic(err)
    }

    r := regexp.MustCompile("<li>(.+)</li>")

    // Find all subexpressions, containing the label <li>
    out := r.FindAll(body, -1)

    for i, v := range out[:10] {
        fmt.Printf("%d: %s
", i, v)
    }

    //Replace to remove the label.
    out2 := make([][]byte, len(out))
    for i, v := range out {
        out2[i] = r.ReplaceAll(v, []byte("$1"))
    }

    for i, v := range out2[:10] {
        fmt.Printf("%d: %s
", i, v)
    }
}

By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK :)

  • 写回答

1条回答 默认 最新

  • dongpu8935 2013-12-06 22:44
    关注

    Recommendation: Use goquery for that task, very simple to use and reduces your code by so much. Example:

    doc, _ := goquery.NewDocument("http://www.elpais.es")
    text := doc.Find("li").Slice(10, -1).Text()
    

    Regarding your question, use FindAllSubmatch to extract the match directly:

    r := regexp.MustCompile("<li>(.+)</li>")
    
    // Find all subexpressions, containing the label <li>
    out := r.FindAllSubmatch(body, -1)
    
    for i, v := range out[:10] {
        fmt.Printf("%d: %s
    ", i, v[1])
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 关于#java#的问题,请各位专家解答!
  • ¥15 急matlab编程仿真二阶震荡系统
  • ¥20 TEC-9的数据通路实验
  • ¥15 ue5 .3之前好好的现在只要是激活关卡就会崩溃
  • ¥50 MATLAB实现圆柱体容器内球形颗粒堆积
  • ¥15 python如何将动态的多个子列表,拼接后进行集合的交集
  • ¥20 vitis-ai量化基于pytorch框架下的yolov5模型
  • ¥15 如何实现H5在QQ平台上的二次分享卡片效果?
  • ¥30 求解达问题(有红包)
  • ¥15 请解包一个pak文件