转到：一次通过regexp FindAll和ReplaceAll

I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.

I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.

Is there a way apply both functions simultaneously, or an equivalent regexp?

Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like ) and a regexp can take care of this.

A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY

func main() {
    res, err := http.Get("http://www.elpais.es")
    if err != nil {
        panic(err)
    }

    body, err := ioutil.ReadAll(res.Body)
    fmt.Println("body: ", len(body), cap(body))
    res.Body.Close()
    if err != nil {
        panic(err)
    }

    r := regexp.MustCompile("<li>(.+)</li>")

    // Find all subexpressions, containing the label <li>
    out := r.FindAll(body, -1)

    for i, v := range out[:10] {
        fmt.Printf("%d: %s
", i, v)
    }

    //Replace to remove the label.
    out2 := make([][]byte, len(out))
    for i, v := range out {
        out2[i] = r.ReplaceAll(v, []byte("$1"))
    }

    for i, v := range out2[:10] {
        fmt.Printf("%d: %s
", i, v)
    }
}

By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK :)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongpu8935 2013-12-06 22:44
关注
Recommendation: Use goquery for that task, very simple to use and reduces your code by so much. Example:

doc, _ := goquery.NewDocument("http://www.elpais.es") text := doc.Find("li").Slice(10, -1).Text()

Regarding your question, use FindAllSubmatch to extract the match directly:

r := regexp.MustCompile("<li>(.+)</li>") // Find all subexpressions, containing the label <li> out := r.FindAllSubmatch(body, -1) for i, v := range out[:10] { fmt.Printf("%d: %s ", i, v[1]) }
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

转到：一次通过regexp FindAll和ReplaceAll
2013-12-06 22:05

回答 1 已采纳 Recommendation: Use goquery for that task, very simple to use and reduces your code by so much. Ex
regexp.FindAll *的最大行长？
2017-07-27 19:15

回答 1 已采纳 There is no real limit on the number of matches. The reason your regex is not getting more matches
php：如何用掩码regexp交换字母 php
2017-10-29 14:28

回答 1 已采纳 You can use preg_replace: $repl = preg_replace('/^([b-df-hj-np-tv-xz]+)([a-z]+)$/i', '$2-$1', $in
regexp.Compile正则匹配的用法
2022-03-13 13:27

暖雪冷泉的博客匹配任意一个字符，*匹配零个或多个，优先匹配更多(贪婪) match, _ := regexp.MatchString("H(.*)d!", "Hello World!") fmt.Println(match) //true // 或 match, _ = regexp.Match("H(.*)d!", []byte("Hello ...
解析regexp时出错：Perl语法无效或不受支持：`（?!
2016-08-13 14:50

回答 1 已采纳 Go regex does not support lookarounds. As a workaround, you may use regexp.MustCompile("^On\\s(
golang regexp FindStringSubmatch（）没有返回正确的最后一组
2018-03-21 13:07

回答 1 已采纳 Regular expressions are greedy, so .* will match 01 leaving just 2 to match the final \d+. You pr
regexp.Compile和regexp.CompilePOSIX有什么区别？
2016-01-16 15:01

回答 1 已采纳 Perl- and POSIX-compatible regular expressions are similar in large parts, but differ in some key
regexp
2022-01-12 20:45

metabit的博客 "regexp" ) func main() { defer func() { if err := recover(); err != nil { fmt.Println(err) } }() str := "abc acc aec acc aaa abb ade" reg := regexp.MustCompile("a.c") fmt.Println(str) ...
转到：用正则表达式交换大小写 javascript
2013-09-28 08:23

回答 1 已采纳 You can't (I think) do this with a regexp, but it's straightforward with strings.Map. package mai
RegEx：从regexp获得错误'empty（sub）expression' mysql php
2016-03-28 10:30

回答 1 已采纳 If I understand what you're trying to do, you aren't even using a regular expression. Just check W
将Go Regexp转换为Javascript javascript
2015-07-08 22:33

回答 1 已采纳 There has to be a better way to do what you're doing than that monstrosity of a pattern. Regular
java regexp_Java 正则表达式
2021-02-28 08:00

多伦多豪的博客正则表达式实例一个字符串其实就是一个简单的正则表达式，例如 Hello World 正则表达式匹配 "Hello World" 字符串。.(点号)也是一个正则表达式，它匹配任何一个字符如："a" 或 "1"。下表列出了一些正则表达式的实例...
开始，regexp：匹配任意一种情况并保留原始文本
2013-10-01 16:26

回答 3 已采纳 You don't even need a capture group for this. package main import "fmt" import "regexp" func ge
java replaceall 捕获组_正则表达式-java
2021-02-28 08:01

weixin_39805119的博客处理文本捕获组正则表达式语法Matcher类方法start和end方法matcher 和lookgingAt方法replaceFirst和replaceAll方法appendReplacement和appendTail方法PatternSyntaxException方法示例一个字符串其实就是一个简单的...
go-regexp
2020-10-15 15:52

y果子的博客 Find(All)?(String)?(Submatch)?(Index)? 若带All，该方法返回一个所有递进匹配结果的slice；该方法需要额外传一个整数n，若n>=0，至多返回n个匹配或子匹配，若x<0，返回全部。若带String，该方法传入的参数...
go源码库学习之regexp库
2022-12-14 11:21

liao__ran的博客 go源码库学习之regexp库
Go的Regexp
2018-08-17 11:06

思维小刀的博客 Golang学习 - regexp 包 ------------------------------------------------------------ // 函数 // 判断在 b（s、r）中能否找到 pattern 所匹配的字符串 func Match(pattern string, b []byte) (matched bool, ...
没有解决我的问题, 去提问

悬赏问题

¥15 素材场景中光线烘焙后灯光失效
¥15 请教一下各位，为什么我这个没有实现模拟点击
¥15 执行 virtuoso 命令后，界面没有，cadence 启动不起来
¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 ubuntu子系统密码忘记
¥15 保护模式-系统加载-段寄存器

转到：一次通过regexp FindAll和ReplaceAll

1条回答 默认 最新

悬赏问题

1条回答默认最新