dongzhao1865 2019-02-09 19:52
浏览 12
已采纳

使用Go的net / html标记生成器处理格式错误的HTML?

I’ve found that the html.NewTokenizer() doesn’t auto-fix some things. So it’s possible that you can end up with a stray closing tag (html.EndTagToken). So <div></p></div> would be html.StartTagToken, html.EndTagToken, html.EndTagToken.

Is there a recommended solution for handling ignoring/removing/fixing these tags?

My first guess would be manually keeping a []atom.Atom slice and push/pop to the list as you start/end each tag (after comparing the tag to make sure you don’t get an unexpected end tag).

Here is some code to demonstrate the problem:

var err error
htm := `<div><div><p></p></p></div>`

tokenizer := html.NewTokenizer(strings.NewReader(htm))

for {

    if tokenizer.Next() == html.ErrorToken {
        err = tokenizer.Err()
        if err == io.EOF {
            err = nil
        }

        return
    }

    token := tokenizer.Token()

    switch token.Type {
    case html.DoctypeToken:
        continue
    case html.CommentToken:
        continue
    case html.SelfClosingTagToken:
        fmt.Println(token.Data)
        continue
    case html.StartTagToken:
        fmt.Printf("<%s>
", token.Data)

    case html.EndTagToken:
        fmt.Printf("</%s>
", token.Data)

    case html.TextToken:
        continue
    default:
        continue
    }
}

Output:

<div>
<div>
<p>
</p>
</p>
</div>
  • 写回答

1条回答 默认 最新

  • dpztth71739 2019-02-09 22:14
    关注

    FWIW, it seems that net/html can fix such issues when you use its Parse method. Here's an example adapted from another SO answer, using your malformed HTML snippet:

    package main
    
    import (
        "bytes"
        "fmt"
        "log"
        "strings"
    
        "golang.org/x/net/html"
    )
    
    func main() {
        brokenHtml := `<div><div><p></p></p></div>`
    
        reader := strings.NewReader(brokenHtml)
        root, err := html.Parse(reader)
    
        if err != nil {
            log.Fatal(err)
        }
    
        var b bytes.Buffer
        html.Render(&b, root)
        fixedHtml := b.String()
    
        fmt.Println(fixedHtml)
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看