dongshilve4392 2016-07-10 15:47
浏览 173
已采纳

HTML解析器忽略img标签(Golang)

My task is to find images urls inside an html

The problem

Html parser golang.org/x/net/html as well as github.com/PuerkitoBio/goquery igonores the biggest image on the page http://www.ozon.ru/context/detail/id/34498204/

The question

  • What is wrong in my code
  • Why required img tag with src="" is ignored?
  • Is there are way to get all images from html with go?

Notes:

  • When i used parser written in Swift this image has been found on the page //static2.ozone.ru/multimedia/spare_covers/1013531536.jpg

  • This image tag has been found when i use regex search.

  • This image tag has been found when i use third party service saveallimages.com

  • I tried to use gokogiri but has no success to compile it on my mac. Go get is successful, but Go build stuck forever.

Parsed html page source

This is the html which is result of resp, _ := http.Get(url)

Code:

package main

import (
  "golang.org/x/net/html"
  "log"
  "net/http"
)


func main() {

  url := "http://www.ozon.ru/context/detail/id/34498204/"

  if resp, err := http.Get(url); err == nil {
    defer resp.Body.Close()

    log.Println("Load page complete")

    if resp != nil {
      log.Println("Page response is NOT nil")

      if document, err := html.Parse(resp.Body); err == nil {

        var parser func(*html.Node)
        parser = func(n *html.Node) {
          if n.Type == html.ElementNode && n.Data == "img" {

            var imgSrcUrl, imgDataOriginal string

            for _, element := range n.Attr {
              if element.Key == "src" {
                imgSrcUrl = element.Val
              }
              if element.Key == "data-original" {
                imgDataOriginal = element.Val
              }
            }

            log.Println(imgSrcUrl, imgDataOriginal)
          }

          for c := n.FirstChild; c != nil; c = c.NextSibling {
            parser(c)
          }

        }
        parser(document)
      } else {
        log.Panicln("Parse html error", err)
      }

    } else {
      log.Println("Page response IS nil")
    }
  }

}
  • 写回答

1条回答 默认 最新

  • douhan1860 2016-07-11 19:10
    关注

    This is not a bug but expected behaviour of x/net/html which affects all parsers based on x/net/html.

    There are four possible solutions:

    1. Remove <noscript> and </noscript> in HTML so x/net/html would parse its content as expected. Something like:

      package main
      
      import (
          "golang.org/x/net/html"
          "log"
          "net/http"
          "io/ioutil"
          "strings"
      )
      
      func main() {
      
          url := "http://www.ozon.ru/context/detail/id/34498204/"
      
          if resp, err := http.Get(url); err == nil {
              defer resp.Body.Close()
      
              log.Println("Load page complete")
      
              if resp != nil {
                  log.Println("Page response is NOT nil")
                  // --------------
                  data, _ := ioutil.ReadAll(resp.Body)
                  resp.Body.Close()
      
                  hdata := strings.Replace(string(data), "<noscript>", "", -1)
                  hdata = strings.Replace(hdata, "</noscript>", "", -1)
                  // --------------
      
                  if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
                      var parser func(*html.Node)
                      parser = func(n *html.Node) {
                          if n.Type == html.ElementNode && n.Data == "img" {
      
                              var imgSrcUrl, imgDataOriginal string
      
                              for _, element := range n.Attr {
                                  if element.Key == "src" {
                                      imgSrcUrl = element.Val
                                  }
                                  if element.Key == "data-original" {
                                      imgDataOriginal = element.Val
                                  }
                              }
      
                              log.Println(imgSrcUrl, imgDataOriginal)
                          }
      
                          for c := n.FirstChild; c != nil; c = c.NextSibling {
                              parser(c)
                          }
      
                      }
                      parser(document)
                  } else {
                      log.Panicln("Parse html error", err)
                  }
      
              } else {
                  log.Println("Page response IS nil")
              }
          }
      
      }
      
    2. Patch x/net/html with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec

    3. Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)

    4. Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法
  • ¥15 可否在不同线程中调用封装数据库操作的类
  • ¥15 微带串馈天线阵列每个阵元宽度计算
  • ¥15 keil的map文件中Image component sizes各项意思
  • ¥20 求个正点原子stm32f407开发版的贪吃蛇游戏
  • ¥15 划分vlan后,链路不通了?
  • ¥20 求各位懂行的人,注册表能不能看到usb使用得具体信息,干了什么,传输了什么数据