dongpan1871 2019-08-09 21:48
浏览 268
已采纳

我需要无法通过html.Parse()解析的HTML

I am writing a Go function to read an HTML response body and extract the page title. Overall, the function works just great, but I want to test the code path where the response body isn't proper HTML at all. My simplistic attempts to create some invalid HTML for unit tests have come to naught.

Apparently, and according to the html.Parse documentation, this is because:

the HTML5 parsing algorithm […] is very complicated. The resultant tree can contain implicitly created nodes that have no explicit <tag> listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end <tag>s. Conversely, explicit <tag>s in r's data can be silently dropped, with no corresponding node in the resulting tree.

Here is some code demonstrating the sort of approach I've been taking:

https://play.golang.org/p/T5WjdtjNcqq

package main

import (
    "bytes"
    "fmt"
    "golang.org/x/net/html"
)

func main() {
    inputs := []string{ "",
        "~",
        "<",
        "<ht",
        "<html",
        "<html>",
        "<html><",
        "<html><titl",
        "<html><title",
        "<html><title>",
        "<html><title>The C Progr",
        "<html><title>The C Programming Language",
        "<html><title>The C Programming Language<",
        "<html><title>The C Programming Language</",
        "<html><title>The C Programming Language</ti",
        "<html><title>The C Programming Language</title",
        "<html><title>The C Programming Language</title>",
        "<html><title>The C Programming Language</title><",
        "<html><title>The C Programming Language</title></",
        "<html><title>The C Programming Language</title></ht",
        "<html><title>The C Programming Language</title></html",
        "<html><title>The C Programming Language</title></html>",
    }

    for _, in := range inputs {
        fmt.Printf("%s
", in)

        r := bytes.NewReader([]byte(in))
        _, err := html.Parse(r)
        if err != nil {
            fmt.Printf("COULD NOT PARSE HTML
")
            panic(err)
        }
    }
}

Silly me, I would have expected many of these to yield an error since at face value they are invalid HTML, but the above code sails through all of the input strings without panic'ing -- that is, with no non-nil err from html.Parse().

I suppose I am grateful for a lenient / tolerant HTML parser, but: Does anyone have an example of text that would yield an error when fed to Go's html.Parse()?

EDIT 1

Combining ideas from comments by Ferrybig and CreationTribe, I even tried a huge stream of random bytes:

    rand.Seed(time.Now().UnixNano())

    in := make([]byte, 0)
    for i := 0; i < 2147483647; i++ {
        in = append(in, byte(rand.Intn(255)))
    }
    fmt.Printf("len(in) : %d
", len(in))

    r := bytes.NewReader(in)
    _, err := html.Parse(r)

… and it still did not error.

Is there no input that will cause html.Parse() to error out?

  • 写回答

1条回答 默认 最新

  • duanbage2161 2019-08-11 19:22
    关注

    From a quick read of https://github.com/golang/net/blob/master/html/token.go, it seems that the only returned errors can be:

    • io.EOF once r is fully read successfully;
    • any other errors returned by the underlying io.Reader; or
    • html.ErrBufferExceeded

    It's not obvious to me after an initial read how trigger ErrBufferExceeded, but you could trigger an error from html.Parse by providing a dummy reader:

    type ErrReader struct { Error error }
    
    func (e *ErrReader) Read([]byte) (int, error) {
        return nil, e.Error
    }
    

    https://play.golang.org/p/s78HpfMLAI8

    Hope that helps

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥40 复杂的限制性的商函数处理
  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码