HTML解析器忽略img标签（Golang）

My task is to find images urls inside an html

The problem

Html parser golang.org/x/net/html as well as github.com/PuerkitoBio/goquery igonores the biggest image on the page http://www.ozon.ru/context/detail/id/34498204/

The question

What is wrong in my code
Why required img tag with src="" is ignored?
Is there are way to get all images from html with go?

Notes:

When i used parser written in Swift this image has been found on the page //static2.ozone.ru/multimedia/spare_covers/1013531536.jpg
This image tag has been found when i use regex search.
This image tag has been found when i use third party service saveallimages.com
I tried to use gokogiri but has no success to compile it on my mac. Go get is successful, but Go build stuck forever.

Parsed html page source

This is the html which is result of resp, _ := http.Get(url)

Code:

package main

import (
  "golang.org/x/net/html"
  "log"
  "net/http"
)


func main() {

  url := "http://www.ozon.ru/context/detail/id/34498204/"

  if resp, err := http.Get(url); err == nil {
    defer resp.Body.Close()

    log.Println("Load page complete")

    if resp != nil {
      log.Println("Page response is NOT nil")

      if document, err := html.Parse(resp.Body); err == nil {

        var parser func(*html.Node)
        parser = func(n *html.Node) {
          if n.Type == html.ElementNode && n.Data == "img" {

            var imgSrcUrl, imgDataOriginal string

            for _, element := range n.Attr {
              if element.Key == "src" {
                imgSrcUrl = element.Val
              }
              if element.Key == "data-original" {
                imgDataOriginal = element.Val
              }
            }

            log.Println(imgSrcUrl, imgDataOriginal)
          }

          for c := n.FirstChild; c != nil; c = c.NextSibling {
            parser(c)
          }

        }
        parser(document)
      } else {
        log.Panicln("Parse html error", err)
      }

    } else {
      log.Println("Page response IS nil")
    }
  }

}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

douhan1860 2016-07-11 19:10

关注

This is not a bug but expected behaviour of x/net/html which affects all parsers based on x/net/html.

There are four possible solutions:

Remove <noscript> and </noscript> in HTML so x/net/html would parse its content as expected. Something like:

package main

import (
    "golang.org/x/net/html"
    "log"
    "net/http"
    "io/ioutil"
    "strings"
)

func main() {

    url := "http://www.ozon.ru/context/detail/id/34498204/"

    if resp, err := http.Get(url); err == nil {
        defer resp.Body.Close()

        log.Println("Load page complete")

        if resp != nil {
            log.Println("Page response is NOT nil")
            // --------------
            data, _ := ioutil.ReadAll(resp.Body)
            resp.Body.Close()

            hdata := strings.Replace(string(data), "<noscript>", "", -1)
            hdata = strings.Replace(hdata, "</noscript>", "", -1)
            // --------------

            if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
                var parser func(*html.Node)
                parser = func(n *html.Node) {
                    if n.Type == html.ElementNode && n.Data == "img" {

                        var imgSrcUrl, imgDataOriginal string

                        for _, element := range n.Attr {
                            if element.Key == "src" {
                                imgSrcUrl = element.Val
                            }
                            if element.Key == "data-original" {
                                imgDataOriginal = element.Val
                            }
                        }

                        log.Println(imgSrcUrl, imgDataOriginal)
                    }

                    for c := n.FirstChild; c != nil; c = c.NextSibling {
                        parser(c)
                    }

                }
                parser(document)
            } else {
                log.Panicln("Parse html error", err)
            }

        } else {
            log.Println("Page response IS nil")
        }
    }

}

Patch x/net/html with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec
Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)
Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

HTML解析器忽略img标签（Golang）
2016-07-10 15:47

回答 1 已采纳 This is not a bug but expected behaviour of x/net/html which affects all parsers based on x/net/ht
如何将文件从HTML选择器发送到Golang API？ html
2018-12-04 10:52

回答 1 已采纳 You are passing data a plain object and telling jQuery not to process it. This means it just gets
Golang HTML中继器 html
2016-06-17 07:41

回答 1 已采纳 I think you're just looking for {{range}}, right? E.g. package main import "log" import "os" imp
go语言解析html
2020-09-26 15:05

会飞的胖达喵的博客 html是html的解析器，把html文本解析出来，goquery基于html包，在此基础上结合cascadia包（一个css选择器工具），实现类似于jquery的功能，操作html非常方便。使用goquery来查找，选择相应的ht.
从HTML调用Golang html javascript
2018-11-27 01:06

回答 1 已采纳 So couple different concepts here. Render: On the initial request to your html that generates the
带有Golang的HTML表单方法发布 html
2018-07-31 16:50

回答 1 已采纳 Per the docs: If a subtree has been registered and a request is received naming the subtree ro
如何在Golang中使用HTML html
2019-07-25 00:23

回答 2 已采纳 ParseFiles stores the names of the list of files as template name. That means, in your case, login
Golang 基础案例集合：中文拼音转换、解析二维码、压缩 zip、执行定时任务
2023-06-09 20:52

yumuing blog的博客总是很喜欢去看那些高大上的东西，却忽略了最基本的东西。然后会错误的以为自己懂的很多，但是其实是沙堆中筑高台，知道很多高大上的架构，但是基础的东西却不太了解。我觉得，可能这就是大部分开发工程师的通病吧。...
Golang：获取系统解析器的DNS服务器列表
2019-02-27 00:02

回答 1 已采纳 The Resolver type in the net package lets you resolve DNS names but it doesn't seem to export the
将标签添加到Golang Prometheus收集器
2018-10-26 23:27

回答 1 已采纳 I figured it out. I had to declare the label where I was calling the NewDesc method and then pass
解析嵌套的YAML Golang
2018-07-13 13:22

回答 1 已采纳 Your type definition: type keys struct { Key1 map[string]key1 `yaml:"key1"` } type key1 stru
Golang基础笔记
2023-05-31 19:23

Jayish的博客 Golang基础笔记，欢迎大家一起讨论学习
解析Golang变量
2018-10-25 16:53

回答 3 已采纳 Ah! The following code prints all the variables accessed with dot notation! package main import
golang基础知识
2023-03-13 16:38

༺࿈誓言࿈༻的博客通过可以查看所有的go命令build : 编译包和依赖；如果是包，当执行之后，会在当前目录下生成一个可执行文件。如果需要再目录下生成相应的文件，需要执行，或者使用，示例run：编译并运行go程序get ：下载并...
golang八股文
2022-10-31 12:02

爱博弈的小陈的博客自己总结的Golang八股文
没有解决我的问题, 去提问

悬赏问题

¥15 关于#hadoop#的问题
¥15 (标签-Python|关键词-socket)
¥15 keil里为什么main.c定义的函数在it.c调用不了
¥50 切换TabTip键盘的输入法
¥15 可否在不同线程中调用封装数据库操作的类
¥15 微带串馈天线阵列每个阵元宽度计算
¥15 keil的map文件中Image component sizes各项意思
¥20 求个正点原子stm32f407开发版的贪吃蛇游戏
¥15 划分vlan后，链路不通了？
¥20 求各位懂行的人，注册表能不能看到usb使用得具体信息，干了什么，传输了什么数据

码龄粉丝数原力等级 --

HTML解析器忽略img标签（Golang）

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

HTML解析器忽略img标签（Golang）

1条回答 默认 最新

悬赏问题

1条回答默认最新