我需要无法通过html.Parse（）解析的HTML

I am writing a Go function to read an HTML response body and extract the page title. Overall, the function works just great, but I want to test the code path where the response body isn't proper HTML at all. My simplistic attempts to create some invalid HTML for unit tests have come to naught.

Apparently, and according to the html.Parse documentation, this is because:

the HTML5 parsing algorithm […] is very complicated. The resultant tree can contain implicitly created nodes that have no explicit <tag> listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end <tag>s. Conversely, explicit <tag>s in r's data can be silently dropped, with no corresponding node in the resulting tree.

Here is some code demonstrating the sort of approach I've been taking:

https://play.golang.org/p/T5WjdtjNcqq

package main

import (
    "bytes"
    "fmt"
    "golang.org/x/net/html"
)

func main() {
    inputs := []string{ "",
        "~",
        "<",
        "<ht",
        "<html",
        "<html>",
        "<html><",
        "<html><titl",
        "<html><title",
        "<html><title>",
        "<html><title>The C Progr",
        "<html><title>The C Programming Language",
        "<html><title>The C Programming Language<",
        "<html><title>The C Programming Language</",
        "<html><title>The C Programming Language</ti",
        "<html><title>The C Programming Language</title",
        "<html><title>The C Programming Language</title>",
        "<html><title>The C Programming Language</title><",
        "<html><title>The C Programming Language</title></",
        "<html><title>The C Programming Language</title></ht",
        "<html><title>The C Programming Language</title></html",
        "<html><title>The C Programming Language</title></html>",
    }

    for _, in := range inputs {
        fmt.Printf("%s
", in)

        r := bytes.NewReader([]byte(in))
        _, err := html.Parse(r)
        if err != nil {
            fmt.Printf("COULD NOT PARSE HTML
")
            panic(err)
        }
    }
}

Silly me, I would have expected many of these to yield an error since at face value they are invalid HTML, but the above code sails through all of the input strings without panic'ing -- that is, with no non-nil err from html.Parse().

I suppose I am grateful for a lenient / tolerant HTML parser, but: Does anyone have an example of text that would yield an error when fed to Go's html.Parse()?

EDIT 1

Combining ideas from comments by Ferrybig and CreationTribe, I even tried a huge stream of random bytes:

    rand.Seed(time.Now().UnixNano())

    in := make([]byte, 0)
    for i := 0; i < 2147483647; i++ {
        in = append(in, byte(rand.Intn(255)))
    }
    fmt.Printf("len(in) : %d
", len(in))

    r := bytes.NewReader(in)
    _, err := html.Parse(r)

… and it still did not error.

Is there no input that will cause html.Parse() to error out?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanbage2161 2019-08-11 19:22
关注
From a quick read of https://github.com/golang/net/blob/master/html/token.go, it seems that the only returned errors can be:

io.EOF once r is fully read successfully;

any other errors returned by the underlying io.Reader; or

html.ErrBufferExceeded

It's not obvious to me after an initial read how trigger ErrBufferExceeded, but you could trigger an error from html.Parse by providing a dummy reader:

type ErrReader struct { Error error } func (e *ErrReader) Read([]byte) (int, error) { return nil, e.Error }

https://play.golang.org/p/s78HpfMLAI8

Hope that helps
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

List items = upload.parseRequest(request); java 有问必答
2021-04-17 16:39

回答 5 已采纳你这代码不像是jsp+servlet技术实现的程序呀，servlet都没有看到呢，上传文件有内置的Part就可以了。
HTML解析错误：服务乱序 - 尝试解析网站时 html php
2015-09-07 10:05

回答 2 已采纳 Use this it will give expected output. <?php $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,"
一个关于textarea标签如何用js获取的问题 html javascript 前端
2022-04-20 09:11

回答 2 已采纳将获取textarea的内容方式由innerHTML改为value即可： var text = document.getElementById("md").value; const data = {
JSON.parse() 方法用来解析JSON字符串
2022-07-20 19:43

杨天天.的博客 JSON.parse()方法用来解析JSON字符串，构造由字符串描述的JavaScript值或对象。提供可选的reviver函数用以在返回之前对所得到的对象执行变换(操作)。
服务器向浏览器传送json数据是保存在.jons还是.txt后缀的文件里？ html5 javascript json 前端
2016-12-04 13:20

回答 1 已采纳都行，如果你返回的相应头没有text/xml，那么xhr.responseXML为null，并且responseXML是xmldom对象，不是字符串，不能用JSON.parse 操作respon
进行HTTP表单解析-返回空的slice / empty值？ http
2017-10-30 18:25

回答 2 已采纳 try changing the content-type on your Postman request from form-data to x-www-form-urlencoded Bec
JSON解析没有从数组中获取值[关闭] ajax javascript json php
2015-04-05 15:13

回答 2 已采纳 You should move the parsing of received JSON into the function called when the AJAX response is de
JavaScript基本语法-JSON的解析，JSON.parse
2022-02-13 11:12

liranke的博客 JSON，即JavaScript Object Notation，是一种轻量级的数据交换格式。JSON可以被用于几乎所有的编程语言中，JSON是一种文本格式，所以可以被人和机器阅读。JSON非常容易被实现和使用...JavaScript提供了JSON的解析API。
无法在javascript中读取Json_encode的键值对 javascript json php
2016-08-14 18:21

回答 1 已采纳 Add this as a first line in the HEAD section of your HTML template <meta content="text/html;ch
检查行是否已被解析并插入到mysql数据库中 javascript mysql php
2014-11-24 23:52

回答 1 已采纳 You can setup proper primary keys and use the insert on duplicate...update syntax: --Say A and B
关于Spring boot + SpringSecurity +jwt token失效的问题 maven spring
2019-12-11 17:33

回答 2 已采纳 jwt的非对称加密是通过私钥签名，公钥验签的，因为服务端没有存储，jwt一旦签发，只要没过期就可以一直使用的，除非你把jwt也存redis里面再进行一次拦截判断
encode后的JSON字符串，JSON.parse解析失败
2018-04-14 15:06

Syleapn的博客转自：闪电Jlaix的微信小程序大坑：encode后的JSON字符串，JSON.parse解析失败今天，遇到微信 JSON 解析的一个大坑。网上找了好久，没有人记录过相同坑，所以写下来吧。跨页面跳转，想传递一个对象。于是先将对象，...
从jquery数据ajax对象的数据数组中获取值 jquery php
2014-10-11 15:16

回答 1 已采纳 $.ajax({ type:'POST', url:'../business_logic/send_chat_user.php', //URL
JSON.stringify和JSON.parse浅析
2022-09-05 14:07

H5周的博客 JSON.sringify将JavaScript对象序列化成字符串，再将字符串通过JSON.parse解析成（反序列化）JavaScript对象，即：JSON.parse(JSON.sringify(obj))。
前端字符串解析HTML
2019-01-23 20:18

m_m!!的博客 parse5 - HTML解析器和序列化器注意：默认情况下，所有函数都使用默认树适配器生成的树格式。可以通过提供自定义树适配器实现来更改树格式。方法： parse - 解析 HTML 字符串，返回一个 Document const ...
【问题记录】JavaScript对象经JSON.stringify()后，再通过 JSON.parse()复原，对象中方法会丢失的解决方案
2023-03-14 13:22

K.ch.的博客 // {"b":"bbbb","c":[1,2,3],"d":{"name":"d"}} // 在转化为JSON字符串的时候，方法已经丢失了 var newObj = JSON.parse(str); console.log(newObj); // obj = { // b: 'bbbb', // c: [1,2,3], // d: { // name: 'd...
没有解决我的问题, 去提问

悬赏问题

¥40 复杂的限制性的商函数处理
¥15 程序不包含适用于入口点的静态Main方法
¥15 素材场景中光线烘焙后灯光失效
¥15 请教一下各位，为什么我这个没有实现模拟点击
¥15 执行 virtuoso 命令后，界面没有，cadence 启动不起来
¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码

我需要无法通过html.Parse（）解析的HTML

1条回答 默认 最新

悬赏问题

1条回答默认最新