转到：一次通过regexp FindAll和ReplaceAll

I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.

I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.

Is there a way apply both functions simultaneously, or an equivalent regexp?

Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like ) and a regexp can take care of this.

A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY

func main() {
    res, err := http.Get("http://www.elpais.es")
    if err != nil {
        panic(err)
    }

    body, err := ioutil.ReadAll(res.Body)
    fmt.Println("body: ", len(body), cap(body))
    res.Body.Close()
    if err != nil {
        panic(err)
    }

    r := regexp.MustCompile("<li>(.+)</li>")

    // Find all subexpressions, containing the label <li>
    out := r.FindAll(body, -1)

    for i, v := range out[:10] {
        fmt.Printf("%d: %s
", i, v)
    }

    //Replace to remove the label.
    out2 := make([][]byte, len(out))
    for i, v := range out {
        out2[i] = r.ReplaceAll(v, []byte("$1"))
    }

    for i, v := range out2[:10] {
        fmt.Printf("%d: %s
", i, v)
    }
}

By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK :)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongpu8935 2013-12-06 22:44
关注
Recommendation: Use goquery for that task, very simple to use and reduces your code by so much. Example:

doc, _ := goquery.NewDocument("http://www.elpais.es") text := doc.Find("li").Slice(10, -1).Text()

Regarding your question, use FindAllSubmatch to extract the match directly:

r := regexp.MustCompile("<li>(.+)</li>") // Find all subexpressions, containing the label <li> out := r.FindAllSubmatch(body, -1) for i, v := range out[:10] { fmt.Printf("%d: %s ", i, v[1]) }
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

转到：一次通过regexp FindAll和ReplaceAll
2013-12-06 22:05

回答 1 已采纳 Recommendation: Use goquery for that task, very simple to use and reduces your code by so much. Ex
regexp.FindAll *的最大行长？
2017-07-27 19:15

回答 1 已采纳 There is no real limit on the number of matches. The reason your regex is not getting more matches
php：如何用掩码regexp交换字母 php
2017-10-29 14:28

回答 1 已采纳 You can use preg_replace: $repl = preg_replace('/^([b-df-hj-np-tv-xz]+)([a-z]+)$/i', '$2-$1', $in
regexp.Compile正则匹配的用法
2022-03-13 13:27

暖雪冷泉的博客匹配任意一个字符，*匹配零个或多个，优先匹配更多(贪婪) match, _ := regexp.MatchString("H(.*)d!", "Hello World!") fmt.Println(match) //true // 或 match, _ = regexp.Match("H(.*)d!", []byte("Hello ...
解析regexp时出错：Perl语法无效或不受支持：`（?!
2016-08-13 14:50

回答 1 已采纳 Go regex does not support lookarounds. As a workaround, you may use regexp.MustCompile("^On\\s(
golang regexp FindStringSubmatch（）没有返回正确的最后一组
2018-03-21 13:07

回答 1 已采纳 Regular expressions are greedy, so .* will match 01 leaving just 2 to match the final \d+. You pr
regexp.Compile和regexp.CompilePOSIX有什么区别？
2016-01-16 15:01

回答 1 已采纳 Perl- and POSIX-compatible regular expressions are similar in large parts, but differ in some key
Golang每日一库之regex
2023-05-10 11:19

始識的博客本文地址： ... 简介正则表达式是一种用来查询、匹配或替换字符串的技术。你可以使用它来找到符合特定模式的文本、删除或替换匹配的字符串。它可以用于各种编程语言和工具中...
转到：用正则表达式交换大小写 javascript
2013-09-28 08:23

回答 1 已采纳 You can't (I think) do this with a regexp, but it's straightforward with strings.Map. package mai
RegEx：从regexp获得错误'empty（sub）expression' mysql php
2016-03-28 10:30

回答 1 已采纳 If I understand what you're trying to do, you aren't even using a regular expression. Just check W
将Go Regexp转换为Javascript javascript
2015-07-08 22:33

回答 1 已采纳 There has to be a better way to do what you're doing than that monstrosity of a pattern. Regular
regexp
2022-01-12 20:45

metabit的博客 "regexp" ) func main() { defer func() { if err := recover(); err != nil { fmt.Println(err) } }() str := "abc acc aec acc aaa abb ade" reg := regexp.MustCompile("a.c") fmt.Println(str) ...
开始，regexp：匹配任意一种情况并保留原始文本
2013-10-01 16:26

回答 3 已采纳 You don't even need a capture group for this. package main import "fmt" import "regexp" func ge
java regexp_Java 正则表达式
2021-02-28 08:00

多伦多豪的博客正则表达式实例一个字符串其实就是一个简单的正则表达式，例如 Hello World 正则表达式匹配 "Hello World" 字符串。.(点号)也是一个正则表达式，它匹配任何一个字符如："a" 或 "1"。下表列出了一些正则表达式的实例...
go源码库学习之regexp库
2022-12-14 11:21

liao__ran的博客 go源码库学习之regexp库
java replaceall 捕获组_正则表达式-java
2021-02-28 08:01

weixin_39805119的博客处理文本捕获组正则表达式语法Matcher类方法start和end方法matcher 和lookgingAt方法replaceFirst和replaceAll方法appendReplacement和appendTail方法PatternSyntaxException方法示例一个字符串其实就是一个简单的...
go-regexp
2020-10-15 15:52

y果子的博客 Find(All)?(String)?(Submatch)?(Index)? 若带All，该方法返回一个所有递进匹配结果的slice；该方法需要额外传一个整数n，若n>=0，至多返回n个匹配或子匹配，若x<0，返回全部。若带String，该方法传入的参数...
没有解决我的问题, 去提问

悬赏问题

¥15 关于#java#的问题，请各位专家解答！
¥15 急matlab编程仿真二阶震荡系统
¥20 TEC-9的数据通路实验
¥15 ue5 .3之前好好的现在只要是激活关卡就会崩溃
¥50 MATLAB实现圆柱体容器内球形颗粒堆积
¥15 python如何将动态的多个子列表，拼接后进行集合的交集
¥20 vitis-ai量化基于pytorch框架下的yolov5模型
¥15 如何实现H5在QQ平台上的二次分享卡片效果？
¥30 求解达问题（有红包）
¥15 请解包一个pak文件

转到：一次通过regexp FindAll和ReplaceAll

1条回答 默认 最新

悬赏问题

1条回答默认最新