JK_laile 2020-11-26 21:27 采纳率: 40%
浏览 41
已采纳

小白写python网络爬虫权威指南出错,求大佬们看一下

# -*- coding: GBK -*-
from bs4 import BeautifulSoup

class Website:
	
	def __init__(self,name,url,targetPattern,absoluteUrl,
		titleTag,bodyTag):
		self.name = name
		self.url = url
		self.targetPattren = targetPattern
		self.absoluteUrl = absoluteUrl
		self.titleTag = titleTag
		self.bodyTag = bodyTag
		
class Content:
	def __init__(self,url,title,body):
		self.url = url
		self.title = title
		self.body = body
		
	def print(self):
		print("URL: {}".format(self.url))
		print("TITLE: {}".format(self.title))
		print("BODY: {}".format(self.body))
		
import re
import requests

class Crawler:
	def __init__(self,site):
		self.site = site
		self.visited = []
		
	def getPage(self,url):
		try:
			req = requests.get(url)
		except requests.exceptions.RequestException:
			return None
		return BeautifulSoup(req.text, 'html.parser')
		
	def safeGet(self,pageObj,selector):
		selectedElems = pageObj.select(selector)
		if selectedElems is not None and len(selectedElems) > 0:
			return '\n'.join([elem.get_text() for elem in selectedElems])
		return ''
		
	def parse(self,url):
		bs = self.getPage(url)
		if bs is not None:
			title = self.safeGet(bs,self.site.titleTag)
			body = self.safeGet(bs,self.site.bodyTag)
			if title != '' and body != '':
				content = Content(url,title,body)
				content.print()

	def crawl(self):
		"""获取网站主页的页面链接"""
		
		bs = self.getPage(self.site.url)
		targetPages = bs.findALL('a',href=re.compile(self.site.targetPattern))
		for targetPage in targetPages:
			targetPage = targetPate.attrs['href']
			if targetPage not in self.visited:
				self.visited.append(targetPage)
				if not self.site.absolutedUrl:
					targetPage = '{}{}'.format(self.site.url,targetPage)
				self.parse(targetPage)
				
reuters = Website('Reuters', 'https://www.reuters.com', '^(/artilce/)', False,
	'h1', 'div.StandardArticleBody_body_1gnLA')

crawler = Crawler(reuters)
crawler.crawl()

代码如上,按照书上打的,运行后是这样的:

 

findALL是书上这么写的,我也试过改成find_all,findall,但都没用,还是报一样的错误

  • 写回答

2条回答 默认 最新

  • 考古学家lx(李玺) python领域优质创作者 2020-11-27 10:12
    关注

    findAll = find_all # BS3

    findChildren = find_all # BS2

    应该是网站更新了吧

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示
  • ¥15 求三国群英传pl国战时间的修改方法
  • ¥15 matlab代码代写,需写出详细代码,代价私
  • ¥15 ROS系统搭建请教(跨境电商用途)