dongzhao1865
2019-02-09 19:52 阅读 9
已采纳

使用Go的net / html标记生成器处理格式错误的HTML?

I’ve found that the html.NewTokenizer() doesn’t auto-fix some things. So it’s possible that you can end up with a stray closing tag (html.EndTagToken). So <div></p></div> would be html.StartTagToken, html.EndTagToken, html.EndTagToken.

Is there a recommended solution for handling ignoring/removing/fixing these tags?

My first guess would be manually keeping a []atom.Atom slice and push/pop to the list as you start/end each tag (after comparing the tag to make sure you don’t get an unexpected end tag).

Here is some code to demonstrate the problem:

var err error
htm := `<div><div><p></p></p></div>`

tokenizer := html.NewTokenizer(strings.NewReader(htm))

for {

    if tokenizer.Next() == html.ErrorToken {
        err = tokenizer.Err()
        if err == io.EOF {
            err = nil
        }

        return
    }

    token := tokenizer.Token()

    switch token.Type {
    case html.DoctypeToken:
        continue
    case html.CommentToken:
        continue
    case html.SelfClosingTagToken:
        fmt.Println(token.Data)
        continue
    case html.StartTagToken:
        fmt.Printf("<%s>
", token.Data)

    case html.EndTagToken:
        fmt.Printf("</%s>
", token.Data)

    case html.TextToken:
        continue
    default:
        continue
    }
}

Output:

<div>
<div>
<p>
</p>
</p>
</div>
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

1条回答 默认 最新

  • 已采纳
    dpztth71739 dpztth71739 2019-02-09 22:14

    FWIW, it seems that net/html can fix such issues when you use its Parse method. Here's an example adapted from another SO answer, using your malformed HTML snippet:

    package main
    
    import (
        "bytes"
        "fmt"
        "log"
        "strings"
    
        "golang.org/x/net/html"
    )
    
    func main() {
        brokenHtml := `<div><div><p></p></p></div>`
    
        reader := strings.NewReader(brokenHtml)
        root, err := html.Parse(reader)
    
        if err != nil {
            log.Fatal(err)
        }
    
        var b bytes.Buffer
        html.Render(&b, root)
        fixedHtml := b.String()
    
        fmt.Println(fixedHtml)
    }
    
    点赞 评论 复制链接分享

相关推荐