douhan8610
2014-01-08 15:48
浏览 585
已采纳

在Golang中从HTML提取文本内容

What's the best way to extract inner substrings from strings in Golang?

input:

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

output:

"this is paragraph 

 this is paragraph 2"

Is there any string package/library for Go that already does something like this?

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph 

    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}

thanks

  • 写回答
  • 好问题 提建议
  • 关注问题
  • 收藏
  • 邀请回答

3条回答 默认 最新

  • duanla3319 2014-01-08 16:00
    已采纳

    Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.

    I recommend you read this article on CodingHorror.

    已采纳该答案
    评论
    解决 无用
    打赏 举报
  • dpict99695329 2015-03-23 09:38

    StrExtract Retrieves a string between two delimiters.

    StrExtract(sExper, cAdelim, cCdelim, nOccur)

    sExper: Specifies the expression to search. sAdelim: Specifies the character that delimits the beginning of sExper.

    sCdelim: Specifies the character that delimits the end of sExper.

    nOccur: Specifies at which occurrence of cAdelim in sExper to start the extraction.

    Go Play

    package main
    
    import (
        "fmt"
        "strings"
    )
    
    func main() {
        s := "a11ba22ba333ba4444ba55555ba666666b"
        fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
    }
    
    func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {
    
        aExper := strings.Split(sExper, sAdelim)
    
        if len(aExper) <= nOccur {
            return ""
        }
    
        sMember := aExper[nOccur]
        aExper = strings.Split(sMember, sCdelim)
    
        if len(aExper) == 1 {
            return ""
        }
    
        return aExper[0]
    }
    
    评论
    解决 无用
    打赏 举报
  • dpa89292 2016-01-21 09:33

    Here is my function that I have been using it a lot.

    func GetInnerSubstring(str string, prefix string, suffix string) string {
        var beginIndex, endIndex int
        beginIndex = strings.Index(str, prefix)
        if beginIndex == -1 {
            beginIndex = 0
            endIndex = 0
        } else if len(prefix) == 0 {
            beginIndex = 0
            endIndex = strings.Index(str, suffix)
            if endIndex == -1 || len(suffix) == 0 {
                endIndex = len(str)
            }
        } else {
            beginIndex += len(prefix)
            endIndex = strings.Index(str[beginIndex:], suffix)
            if endIndex == -1 {
                if strings.Index(str, suffix) < beginIndex {
                    endIndex = beginIndex
                } else {
                    endIndex = len(str)
                }
            } else {
                if len(suffix) == 0 {
                    endIndex = len(str)
                } else {
                    endIndex += beginIndex
                }
            }
        }
    
        return str[beginIndex:endIndex]
    }
    

    You can try it at the playground, https://play.golang.org/p/Xo0SJu0Vq4.

    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题