I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.
I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.
Is there a way apply both functions simultaneously, or an equivalent regexp?
Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like ) and a regexp can take care of this.
A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY
func main() {
res, err := http.Get("http://www.elpais.es")
if err != nil {
panic(err)
}
body, err := ioutil.ReadAll(res.Body)
fmt.Println("body: ", len(body), cap(body))
res.Body.Close()
if err != nil {
panic(err)
}
r := regexp.MustCompile("<li>(.+)</li>")
// Find all subexpressions, containing the label <li>
out := r.FindAll(body, -1)
for i, v := range out[:10] {
fmt.Printf("%d: %s
", i, v)
}
//Replace to remove the label.
out2 := make([][]byte, len(out))
for i, v := range out {
out2[i] = r.ReplaceAll(v, []byte("$1"))
}
for i, v := range out2[:10] {
fmt.Printf("%d: %s
", i, v)
}
}
By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK :)