I am writing a Go function to read an HTML response body and extract the page title. Overall, the function works just great, but I want to test the code path where the response body isn't proper HTML at all. My simplistic attempts to create some invalid HTML for unit tests have come to naught.
Apparently, and according to the html.Parse
documentation, this is because:
the HTML5 parsing algorithm […] is very complicated. The resultant tree can contain implicitly created nodes that have no explicit
<tag>
listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end<tag>
s. Conversely, explicit<tag>
s in r's data can be silently dropped, with no corresponding node in the resulting tree.
Here is some code demonstrating the sort of approach I've been taking:
https://play.golang.org/p/T5WjdtjNcqq
package main
import (
"bytes"
"fmt"
"golang.org/x/net/html"
)
func main() {
inputs := []string{ "",
"~",
"<",
"<ht",
"<html",
"<html>",
"<html><",
"<html><titl",
"<html><title",
"<html><title>",
"<html><title>The C Progr",
"<html><title>The C Programming Language",
"<html><title>The C Programming Language<",
"<html><title>The C Programming Language</",
"<html><title>The C Programming Language</ti",
"<html><title>The C Programming Language</title",
"<html><title>The C Programming Language</title>",
"<html><title>The C Programming Language</title><",
"<html><title>The C Programming Language</title></",
"<html><title>The C Programming Language</title></ht",
"<html><title>The C Programming Language</title></html",
"<html><title>The C Programming Language</title></html>",
}
for _, in := range inputs {
fmt.Printf("%s
", in)
r := bytes.NewReader([]byte(in))
_, err := html.Parse(r)
if err != nil {
fmt.Printf("COULD NOT PARSE HTML
")
panic(err)
}
}
}
Silly me, I would have expected many of these to yield an error since at face value they are invalid HTML, but the above code sails through all of the input strings without panic
'ing -- that is, with no non-nil
err
from html.Parse()
.
I suppose I am grateful for a lenient / tolerant HTML parser, but: Does anyone have an example of text that would yield an error when fed to Go's html.Parse()
?
EDIT 1
Combining ideas from comments by Ferrybig and CreationTribe, I even tried a huge stream of random bytes:
rand.Seed(time.Now().UnixNano())
in := make([]byte, 0)
for i := 0; i < 2147483647; i++ {
in = append(in, byte(rand.Intn(255)))
}
fmt.Printf("len(in) : %d
", len(in))
r := bytes.NewReader(in)
_, err := html.Parse(r)
… and it still did not error.
Is there no input that will cause html.Parse()
to error out?