使用BeautifulSoup或golang colly解析HTML时遇到问题

FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):

I attempt to target the div.section_content:

import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})

Printing the last line shows some other divs, but not the one with the pitching data.

However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):

>>> "Pitching" in soup.text
True

Here is an abbreviated version of one of the golang attempts:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
) 

func main() {
    c := colly.NewCollector(
            colly.AllowedDomains("www.baseball-reference.com"),
    )   
    c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
            fmt.Println(e.ChildText("div.section_content"))
    })  
    c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")

} }

展开全部

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongpan9760 2018-07-11 23:32
关注
It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.

For example:

for element in soup(text=lambda text: isinstance(text, Comment)): comment = element.extract() comment_soup = BeautifulSoup(comment) # work with comment_soup
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报
编辑

预览
轻敲空格完成输入
显示为

卡片

标题

链接
评论

按下Enter换行，Ctrl+Enter发表内容

编辑

预览

报告相同问题？

关注问题

python beautifulsoup 解析html无法获得全部html代码 python
2021-01-04 07:04

回答 3 已采纳因为这个div里面的内容是用ajax动态加载的，而用request获取的是网页的源代码（就是“右键菜单->查看网页源代码”的内容），不包含ajax动态加载的内容。所以要找到ajax加载数据的
python使用BeautifulSoup遇到的问题 python
2018-03-21 01:23

回答 4 已采纳 bsObj=BeautifulSoup(html, "html.parser",from_encoding='utf-8') 试试html.parser
Python爬虫 BeautifulSoup解析网页爬取内容为None python 有问必答
2021-08-31 06:07

回答 2 已采纳你抓的频率太快，IP被墙了
golang比起python爬虫的优势_强大高效而精简易用的Golang爬虫框架Colly，能否取代 Scrapy？...
2020-12-05 11:51

weixin_39630182的博客前言任何刚接触爬虫编程的朋友可能都熟悉或者或多或少了解过基于 Python 异步框架 Twisted 的爬虫框架 Scrapy。Scrapy 发展了将近 7 年，是爬虫框架中的开山鼻祖，自然而然成为最受欢迎的也是应用最广的爬虫框架。...
beautifulSoup4爬虫问题，python简单代码请教一下 python 有问必答爬虫
2022-01-15 09:29

回答 1 已采纳就是获取 soup.find_all("script", type="text/javascript") 返回的结果，取第3个元素的文本。
关于python2使用beautifulsoup定位元素的问题
2017-08-06 23:35

回答 1 已采纳这是我之前写的一个爬虫，或许可以参考一下： https://github.com/Tangworld/CodeDefectLocation/blob/master/getdata/Aspectj/c
python 使用BeautifulSoup 出错 python
2017-08-16 00:57

回答 3 已采纳 nostarchsoup=bs4.BeautifulSoup(res.text，'html.parser')这样写
golang python 发展_Golang 发展到现在是否有类似 Python 那样数据分析和爬虫包呢？...
2020-12-04 11:16

weixin_39822728的博客 http可理解为python的requests，goquery可以理解为python的beautifulsoup，即它们分别可用于获取和解析网页。 goquery的语法类似jquery，可以便捷的操作dom，github地址如下：PuerkitoBio/goquerygithub....
用BeautifulSoup4 解析html的内容
2018-05-04 23:01

回答 2 已采纳 soup = BeautifulSoup(html, 'html.parser') # html为您的html内容 text = soup.find('div').text
python—用PIP安装了beautifulsoup库，使用的时候老是报错 python
2021-08-10 08:18

回答 2 已采纳不能带有空格
python爬虫爬取网页代码遇到了一些问题 python 爬虫
2022-08-17 09:07

回答 3 已采纳因为元素里的你要的内容是通过 ajax 请求动态加载的，可以浏览器抓包去看下，你想要的这条数据到底是哪个请求返回的，找到真正的请求，然后模拟发送就行了
goquery 查找html标签,Go语言爬虫框架之Colly和Goquery
2021-06-15 19:30

是鹿大仙的博客写在前面Go语言爬虫框架之Colly和GoqueryPython框架框架比较有BeautifulSoup或Scrapy，基于Go的爬虫框架是比较强健的，尤其Colly和Goquery是比较强大的工具，其灵活性和表达性都比较优秀。网络爬虫网络爬虫是什么?...
Go语言爬虫框架之Colly和Goquery
2020-12-21 11:56

BigManing的博客文章目录写在前面Go语言爬虫框架之Colly和Goquery网络爬虫爬虫的简单算法Colly开始OnHTMLOnRequest / ...Python框架框架比较有BeautifulSoup或Scrapy，基于Go的爬虫框架是比较强健的，尤其Col
go爬虫和python爬虫_Golang 发展到现在是否有类似 Python 那样数据分析和爬虫包呢？...
2020-11-20 12:34

weixin_39644325的博客 http可理解为python的requests，goquery可以理解为python的beautifulsoup，即它们分别可用于获取和解析网页。 goquery的语法类似jquery，可以便捷的操作dom，github地址如下：PuerkitoBio/goquerygithub....
go爬虫和python爬虫_Go语言实战爬虫项目
2020-11-23 14:01

weixin_39784972的博客 Go语言爬虫框架之Colly和GoqueryPython爬虫框架比较多有requests、urllib, pyquery,scrapy等，解析库有BeautifulSoup、pyquery、Scrapy和lxml等等，基于Go的爬虫框架是比较强健的，尤其Colly和Goquery是比较强大的...
没有解决我的问题, 去提问

使用BeautifulSoup或golang colly解析HTML时遇到问题

1条回答 默认 最新

1条回答默认最新