2018-07-12 07:23
浏览 199

使用BeautifulSoup或golang colly解析HTML时遇到问题

FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):

enter image description here

I attempt to target the div.section_content:

import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})

Printing the last line shows some other divs, but not the one with the pitching data.

However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):

>>> "Pitching" in soup.text

Here is an abbreviated version of one of the golang attempts:

package main

import (

func main() {
    c := colly.NewCollector(
    c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {

} }

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • dongpan9760 2018-07-12 07:32

    It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.

    For example:

    for element in soup(text=lambda text: isinstance(text, Comment)):
        comment = element.extract()
        comment_soup = BeautifulSoup(comment)
        # work with comment_soup
    点赞 打赏 评论

相关推荐 更多相似问题