drk7700 2018-07-12 07:23
浏览 230
已采纳

使用BeautifulSoup或golang colly解析HTML时遇到问题

FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):

enter image description here

I attempt to target the div.section_content:

import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})

Printing the last line shows some other divs, but not the one with the pitching data.

However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):

>>> "Pitching" in soup.text
True

Here is an abbreviated version of one of the golang attempts:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
) 

func main() {
    c := colly.NewCollector(
            colly.AllowedDomains("www.baseball-reference.com"),
    )   
    c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
            fmt.Println(e.ChildText("div.section_content"))
    })  
    c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")

} }

  • 写回答

1条回答 默认 最新

  • dongpan9760 2018-07-12 07:32
    关注

    It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.

    For example:

    for element in soup(text=lambda text: isinstance(text, Comment)):
        comment = element.extract()
        comment_soup = BeautifulSoup(comment)
        # work with comment_soup
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 #MATLAB仿真#车辆换道路径规划
  • ¥15 java 操作 elasticsearch 8.1 实现 索引的重建
  • ¥15 数据可视化Python
  • ¥15 要给毕业设计添加扫码登录的功能!!有偿
  • ¥15 kafka 分区副本增加会导致消息丢失或者不可用吗?
  • ¥15 微信公众号自制会员卡没有收款渠道啊
  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘