doulu2576 2017-09-20 01:04
浏览 83
已采纳

使用Go解析HTML

I'm trying to build a web-scraper using Go, I'm fairly new to the language and I'm not sure what I'm doing wrong while using the html parser. I'm trying to parse the html to find anchor tags but I keep getting html.TokenTypeEnd instead.

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "io/ioutil"
    "net/http"
)

func GetHtml(url string) (text string, resp *http.Response, err error) {
    var bytes []byte
    if url == "https://www.coastal.edu/scs/employee" {
        resp, err = http.Get(url)
        if err != nil {
            fmt.Println("There seems to ben an error with the Employee Console.")
        }
        bytes, err = ioutil.ReadAll(resp.Body)
        if err != nil {
            fmt.Println("Cannot read byte response from Employee Console.")
        }
        text = string(bytes)
    } else {
        fmt.Println("Issue with finding URL. Looking for: " + url)
    }

    return text, resp, err
}

func main() {
    htmlSrc, response, err := GetHtml("https://www.coastal.edu/scs/employee")
    if err != nil {
        fmt.Println("Cannot read HTML source code.")
    }
    _ = htmlSrc
    htmlTokens := html.NewTokenizer(response.Body)
    i := 0
    for i < 1 {

        tt := htmlTokens.Next()
        fmt.Printf("%T", tt)
        switch tt {

        case html.ErrorToken:
            fmt.Println("End")
            i++

        case html.TextToken:
            fmt.Println(tt)

        case html.StartTagToken:
            t := htmlTokens.Token()

            isAnchor := t.Data == "a"
            if isAnchor {
                fmt.Println("We found an anchor!")
            }

        }

    }

I'm getting html.TokenTypeEnd whenever I'm printing fmt.Printf("%T", tt)

  • 写回答

1条回答 默认 最新

  • dongzhanlu0658 2017-09-20 01:25
    关注

    The application reads to the end of the body in GetHtml. The tokenizer returns html.TokenTypeEnd because read on the body returns EOF.

    Use this code:

    htmlTokens := html.NewTokenizer(strings.NewReader(htmlSrc))
    

    to create the tokenizer.

    Also, close the response body inGetHtml to prevent a connection leak.

    The code can be simplified to:

        response, err := http.Get("https://www.coastal.edu/scs/employee")
        if err != nil {
            log.Fatal(err)
        }
        defer response.Body.Close()
        htmlTokens := html.NewTokenizer(response.Body)
    loop:
        for {
            tt := htmlTokens.Next()
            fmt.Printf("%T", tt)
            switch tt {
            case html.ErrorToken:
                fmt.Println("End")
                break loop
            case html.TextToken:
                fmt.Println(tt)
            case html.StartTagToken:
                t := htmlTokens.Token()
                isAnchor := t.Data == "a"
                if isAnchor {
                    fmt.Println("We found an anchor!")
                }
            }
        }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 r语言神经网络自变量重要性分析
  • ¥15 基于双目测规则物体尺寸
  • ¥15 wegame打不开英雄联盟
  • ¥15 公司的电脑,win10系统自带远程协助,访问家里个人电脑,提示出现内部错误,各种常规的设置都已经尝试,感觉公司对此功能进行了限制(我们是集团公司)
  • ¥15 救!ENVI5.6深度学习初始化模型报错怎么办?
  • ¥30 eclipse开启服务后,网页无法打开
  • ¥30 雷达辐射源信号参考模型
  • ¥15 html+css+js如何实现这样子的效果?
  • ¥15 STM32单片机自主设计
  • ¥15 如何在node.js中或者java中给wav格式的音频编码成sil格式呢