使用Go的net / html标记生成器处理格式错误的HTML？

I’ve found that the html.NewTokenizer() doesn’t auto-fix some things. So it’s possible that you can end up with a stray closing tag (html.EndTagToken). So <div></p></div> would be html.StartTagToken, html.EndTagToken, html.EndTagToken.

Is there a recommended solution for handling ignoring/removing/fixing these tags?

My first guess would be manually keeping a []atom.Atom slice and push/pop to the list as you start/end each tag (after comparing the tag to make sure you don’t get an unexpected end tag).

Here is some code to demonstrate the problem:

var err error
htm := `<div><div><p></p></p></div>`

tokenizer := html.NewTokenizer(strings.NewReader(htm))

for {

    if tokenizer.Next() == html.ErrorToken {
        err = tokenizer.Err()
        if err == io.EOF {
            err = nil
        }

        return
    }

    token := tokenizer.Token()

    switch token.Type {
    case html.DoctypeToken:
        continue
    case html.CommentToken:
        continue
    case html.SelfClosingTagToken:
        fmt.Println(token.Data)
        continue
    case html.StartTagToken:
        fmt.Printf("<%s>
", token.Data)

    case html.EndTagToken:
        fmt.Printf("</%s>
", token.Data)

    case html.TextToken:
        continue
    default:
        continue
    }
}

Output:

<div>
<div>
<p>
</p>
</p>
</div>

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dpztth71739 2019-02-09 22:14
关注
FWIW, it seems that net/html can fix such issues when you use its Parse method. Here's an example adapted from another SO answer, using your malformed HTML snippet:

package main import ( "bytes" "fmt" "log" "strings" "golang.org/x/net/html" ) func main() { brokenHtml := `<div><div><p></p></p></div>` reader := strings.NewReader(brokenHtml) root, err := html.Parse(reader) if err != nil { log.Fatal(err) } var b bytes.Buffer html.Render(&b, root) fixedHtml := b.String() fmt.Println(fixedHtml) }
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

使用Go的net / html标记生成器处理格式错误的HTML？ html
2019-02-09 19:52

回答 1 已采纳 FWIW, it seems that net/html can fix such issues when you use its Parse method. Here's an example
导入错误：带有html的golang.org/x/net/html html
2019-02-23 17:29

回答 2 已采纳 Using _ "golang.org/x/net/html" you import the package but you cut-off all access to it, this is u
golang html / template ExecuteTemplate错误字节来自哪里？ html
2016-02-02 00:59

回答 1 已采纳 Because template content starts immediately following the closing braces in the define directive,
第十六天前端HTML、CSS、JavaScript详细总结（内置VSCode安装教程）
2023-03-17 09:17

HuanLe.的博客前端笔记总结
从net / html令牌生成器获取流中的当前位置
2017-11-16 21:30

回答 2 已采纳 A non-buffering reader ended up working ok for me. The implementation of the reader looks somethin
如何在golang html / template中创建全局变量并在多个位置进行更改？ html
2016-04-09 19:46

回答 1 已采纳 Short answer is: you can't. By design philosophy, templates should not contain complex logic. In
使用类似Marshal的编码器处理HTML html xhtml xml
2018-03-07 01:41

回答 1 已采纳 XSLT transform XML data with template and generates text or xhtml output. This might work for your
前端面试之HTML篇
2023-02-25 22:09

这里就记录我的成长吧的博客前端面试之HTML篇
如何使用Go修改HTML文件的元素？ html xml
2017-02-19 06:00

回答 1 已采纳 You should definitely save the edited document. First, open the file for read/write and truncate:
如何在Golang中使用os / exec处理用户输入？我无法停止输入阶段
2019-09-04 02:33

回答 3 已采纳 The program is blocking because the fmt.Scanln in the subprocess is waiting for the character (a
如何处理低级的net / http错误？
2017-08-04 15:18

回答 2 已采纳 If you look at how (*http.Server).Serve() is implemented here, it is clear that you can just defin
前端三剑客（html、css、js）面试题
2023-04-19 21:02

weixin_45754783的博客 html5：添加了许多具有语义化的标签，如：
HTML / PHP中的数据URI自动生成器 html php
2014-08-24 06:04

回答 1 已采纳 For data URIs, they are like a type of heredoc, meaning they contain all the information in-line,
最新Web前端面试题-HTML-CSS-Javascript
2021-12-15 14:19

DamonSalvatore18的博客一、HTML基础 1.常见浏览器的内核（1）IE: trident 内核（2）Firefox：gecko 内核（3）Safari: webkit 内核（4）Opera: 以前是 presto 内核，Opera 现已改用 Google Chrome 的 Blink 内核（5）Chrome:Blink...
【尚硅谷】Web前端零基础入门HTML5+CSS3基础教程
2022-02-17 11:41

chen_nnn的博客笔者在哔哩哔哩弹幕网学习前端器件内所做的学习笔记，分享出来供大家参考。
没有解决我的问题, 去提问

悬赏问题

¥15 微信小程序协议怎么写
¥15 c语言怎么用printf（“\b \b”）与getch（）实现黑框里写入与删除？
¥20 怎么用dlib库的算法识别小麦病虫害
¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
¥15 java写代码遇到问题，求帮助
¥15 uniapp uview http 如何实现统一的请求异常信息提示？
¥15 有了解d3和topogram.js库的吗？有偿请教
¥100 任意维数的K均值聚类
¥15 stamps做sbas-insar，时序沉降图怎么画
¥15 买了个传感器，根据商家发的代码和步骤使用但是代码报错了不会改，有没有人可以看看

使用Go的net / html标记生成器处理格式错误的HTML？

1条回答 默认 最新

悬赏问题

1条回答默认最新