douhan8610 2014-01-08 07:48
浏览 653
已采纳

在Golang中从HTML提取文本内容

What's the best way to extract inner substrings from strings in Golang?

input:

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

output:

"this is paragraph 

 this is paragraph 2"

Is there any string package/library for Go that already does something like this?

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph 

    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}

thanks

展开全部

  • 写回答

3条回答 默认 最新

  • duanla3319 2014-01-08 08:00
    关注

    Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.

    I recommend you read this article on CodingHorror.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)
编辑
预览

报告相同问题?