在Golang中从HTML提取文本内容

What's the best way to extract inner substrings from strings in Golang?

input:

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

output:

"this is paragraph 

 this is paragraph 2"

Is there any string package/library for Go that already does something like this?

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph 

    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}

thanks

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanla3319 2014-01-08 16:00
关注
Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.

I recommend you read this article on CodingHorror.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

在Golang中从HTML提取文本内容
2014-01-08 15:48

回答 3 已采纳 Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and pa
如何在Golang中使用HTML html
2019-07-25 00:23

回答 2 已采纳 ParseFiles stores the names of the list of files as template name. That means, in your case, login
在Golang中提取* html.Node的位置偏移 html
2016-01-15 13:34

回答 2 已采纳 I come up with solution where we extend (please fix me if there's another way to do it) original H
html2text:Golang HTML到纯文本转换库
2021-05-07 19:39

html2text是一个简单的golang包，用于将HTML呈现为纯文本。仍然有很多改进，但是FWIW可以很好地满足我的[HTML]基本HTML-2文本需求。它需要go 1.x或更高版本;）下载包 go get jaytaylor.com/html2text 用法示例...
在Golang中提取部分字符串？
2016-07-27 23:08

回答 3 已采纳 There are a few options: // match regexp as in question pat := regexp.MustCompile(`https?://.*\.t
在golang中从JSON动态删除密钥 json
2019-08-01 11:25

回答 1 已采纳 You may unmarshal into a value of type interface{} if you don't know anything about the JSON. The
在Golang中从html创建pdf
2013-02-17 16:58

回答 7 已采纳 what about gopdf (https://github.com/signintech/gopdf) or gofpdf (http://godoc.org/code.google.com
doc-extract:Go工具，用于从特殊标记的Go注释中提取文本
2021-05-01 06:56

doc-extract是用于在Go源代码中提取带有特殊标记的注释的工具。带标签的注释以包含+extract的空白行开头。支持分组行注释（ // ）和块注释（ /* */ ）。安装转到1.16和更高版本： go install github....
在Golang中从stdin读取输入
2017-08-28 07:39

回答 1 已采纳 For simple uses, a Scanner may be more convenient. You should not use two readers, first read, buf
从HTML调用Golang html javascript
2018-11-27 01:06

回答 1 已采纳 So couple different concepts here. Render: On the initial request to your html that generates the
正在从Golang中读取文本文件？
2016-07-31 14:11

回答 2 已采纳 Use the bufio package. Here's the basic syntax for opening a text file and looping through each l
golang html获取内容,goLang爬取html
2021-06-14 03:48

冯爽妹的博客 package mainimport ("fmt""net/http""os""strconv")//爬取网页内容func HttpGet(url string)(result string ,err error){resp , err1 :=http.Get(url)if err1 !=nil{err = err1return}defer resp.Body.Close()//...
使用Golang中的regexp从URL提取子域
2018-03-13 13:16

回答 1 已采纳 The url package has a function parse that allows you to parse an URL. The parsed URL instance has
golang去掉html代码中的标签，只保留纯文本
2023-09-12 15:23

lmy_loveF的博客【代码】golang去掉html代码中的标签，只保留纯文本。
golang 提取html数据,goLang 多线程抓取网页数据-Go语言中文社区
2021-06-12 04:56

gjbgyuhg的博客突然有个想法想用goLang快速的抓取网页数据,于是想到了多线程进行页面抓取package mainimport ("fmt""log""net/http""os""strconv""sync""time")func init() {defer func() {if err := recover(); err != nil {fmt....
没有解决我的问题, 去提问

悬赏问题

¥30 vmware exsi重置后的密码
¥15 易盾点选的cb参数怎么解啊
¥15 MATLAB运行显示错误，如何解决？
¥15 c++头文件不能识别CDialog
¥15 Excel发现不可读取的内容
¥15 关于#stm32#的问题：CANOpen的PDO同步传输问题
¥20 yolov5自定义Prune报错，如何解决？
¥15 电磁场的matlab仿真
¥15 mars2d在vue3中的引入问题
¥50 h5唤醒支付宝并跳转至向小荷包转账界面

在Golang中从HTML提取文本内容

3条回答 默认 最新

悬赏问题

3条回答默认最新