dongshilve4392 2016-07-10 15:47
浏览 173
已采纳

HTML解析器忽略img标签(Golang)

My task is to find images urls inside an html

The problem

Html parser golang.org/x/net/html as well as github.com/PuerkitoBio/goquery igonores the biggest image on the page http://www.ozon.ru/context/detail/id/34498204/

The question

  • What is wrong in my code
  • Why required img tag with src="" is ignored?
  • Is there are way to get all images from html with go?

Notes:

  • When i used parser written in Swift this image has been found on the page //static2.ozone.ru/multimedia/spare_covers/1013531536.jpg

  • This image tag has been found when i use regex search.

  • This image tag has been found when i use third party service saveallimages.com

  • I tried to use gokogiri but has no success to compile it on my mac. Go get is successful, but Go build stuck forever.

Parsed html page source

This is the html which is result of resp, _ := http.Get(url)

Code:

package main

import (
  "golang.org/x/net/html"
  "log"
  "net/http"
)


func main() {

  url := "http://www.ozon.ru/context/detail/id/34498204/"

  if resp, err := http.Get(url); err == nil {
    defer resp.Body.Close()

    log.Println("Load page complete")

    if resp != nil {
      log.Println("Page response is NOT nil")

      if document, err := html.Parse(resp.Body); err == nil {

        var parser func(*html.Node)
        parser = func(n *html.Node) {
          if n.Type == html.ElementNode && n.Data == "img" {

            var imgSrcUrl, imgDataOriginal string

            for _, element := range n.Attr {
              if element.Key == "src" {
                imgSrcUrl = element.Val
              }
              if element.Key == "data-original" {
                imgDataOriginal = element.Val
              }
            }

            log.Println(imgSrcUrl, imgDataOriginal)
          }

          for c := n.FirstChild; c != nil; c = c.NextSibling {
            parser(c)
          }

        }
        parser(document)
      } else {
        log.Panicln("Parse html error", err)
      }

    } else {
      log.Println("Page response IS nil")
    }
  }

}
  • 写回答

1条回答 默认 最新

  • douhan1860 2016-07-11 19:10
    关注

    This is not a bug but expected behaviour of x/net/html which affects all parsers based on x/net/html.

    There are four possible solutions:

    1. Remove <noscript> and </noscript> in HTML so x/net/html would parse its content as expected. Something like:

      package main
      
      import (
          "golang.org/x/net/html"
          "log"
          "net/http"
          "io/ioutil"
          "strings"
      )
      
      func main() {
      
          url := "http://www.ozon.ru/context/detail/id/34498204/"
      
          if resp, err := http.Get(url); err == nil {
              defer resp.Body.Close()
      
              log.Println("Load page complete")
      
              if resp != nil {
                  log.Println("Page response is NOT nil")
                  // --------------
                  data, _ := ioutil.ReadAll(resp.Body)
                  resp.Body.Close()
      
                  hdata := strings.Replace(string(data), "<noscript>", "", -1)
                  hdata = strings.Replace(hdata, "</noscript>", "", -1)
                  // --------------
      
                  if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
                      var parser func(*html.Node)
                      parser = func(n *html.Node) {
                          if n.Type == html.ElementNode && n.Data == "img" {
      
                              var imgSrcUrl, imgDataOriginal string
      
                              for _, element := range n.Attr {
                                  if element.Key == "src" {
                                      imgSrcUrl = element.Val
                                  }
                                  if element.Key == "data-original" {
                                      imgDataOriginal = element.Val
                                  }
                              }
      
                              log.Println(imgSrcUrl, imgDataOriginal)
                          }
      
                          for c := n.FirstChild; c != nil; c = c.NextSibling {
                              parser(c)
                          }
      
                      }
                      parser(document)
                  } else {
                      log.Panicln("Parse html error", err)
                  }
      
              } else {
                  log.Println("Page response IS nil")
              }
          }
      
      }
      
    2. Patch x/net/html with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec

    3. Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)

    4. Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 随身WiFi网络灯亮但是没有网络,如何解决?
  • ¥15 gdf格式的脑电数据如何处理matlab
  • ¥20 重新写的代码替换了之后运行hbuliderx就这样了
  • ¥100 监控抖音用户作品更新可以微信公众号提醒
  • ¥15 UE5 如何可以不渲染HDRIBackdrop背景
  • ¥70 2048小游戏毕设项目
  • ¥20 mysql架构,按照姓名分表
  • ¥15 MATLAB实现区间[a,b]上的Gauss-Legendre积分
  • ¥15 delphi webbrowser组件网页下拉菜单自动选择问题
  • ¥15 linux驱动,linux应用,多线程