JK_laile 2020-11-26 21:27 采纳率: 40%
浏览 41
已采纳

小白写python网络爬虫权威指南出错,求大佬们看一下

# -*- coding: GBK -*-
from bs4 import BeautifulSoup

class Website:
	
	def __init__(self,name,url,targetPattern,absoluteUrl,
		titleTag,bodyTag):
		self.name = name
		self.url = url
		self.targetPattren = targetPattern
		self.absoluteUrl = absoluteUrl
		self.titleTag = titleTag
		self.bodyTag = bodyTag
		
class Content:
	def __init__(self,url,title,body):
		self.url = url
		self.title = title
		self.body = body
		
	def print(self):
		print("URL: {}".format(self.url))
		print("TITLE: {}".format(self.title))
		print("BODY: {}".format(self.body))
		
import re
import requests

class Crawler:
	def __init__(self,site):
		self.site = site
		self.visited = []
		
	def getPage(self,url):
		try:
			req = requests.get(url)
		except requests.exceptions.RequestException:
			return None
		return BeautifulSoup(req.text, 'html.parser')
		
	def safeGet(self,pageObj,selector):
		selectedElems = pageObj.select(selector)
		if selectedElems is not None and len(selectedElems) > 0:
			return '\n'.join([elem.get_text() for elem in selectedElems])
		return ''
		
	def parse(self,url):
		bs = self.getPage(url)
		if bs is not None:
			title = self.safeGet(bs,self.site.titleTag)
			body = self.safeGet(bs,self.site.bodyTag)
			if title != '' and body != '':
				content = Content(url,title,body)
				content.print()

	def crawl(self):
		"""获取网站主页的页面链接"""
		
		bs = self.getPage(self.site.url)
		targetPages = bs.findALL('a',href=re.compile(self.site.targetPattern))
		for targetPage in targetPages:
			targetPage = targetPate.attrs['href']
			if targetPage not in self.visited:
				self.visited.append(targetPage)
				if not self.site.absolutedUrl:
					targetPage = '{}{}'.format(self.site.url,targetPage)
				self.parse(targetPage)
				
reuters = Website('Reuters', 'https://www.reuters.com', '^(/artilce/)', False,
	'h1', 'div.StandardArticleBody_body_1gnLA')

crawler = Crawler(reuters)
crawler.crawl()

代码如上,按照书上打的,运行后是这样的:

 

findALL是书上这么写的,我也试过改成find_all,findall,但都没用,还是报一样的错误

  • 写回答

2条回答 默认 最新

  • 考古学家lx(李玺) python领域优质创作者 2020-11-27 10:12
    关注

    findAll = find_all # BS3

    findChildren = find_all # BS2

    应该是网站更新了吧

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 C# P/Invoke的效率问题
  • ¥20 thinkphp适配人大金仓问题
  • ¥20 Oracle替换.dbf文件后无法连接,如何解决?(相关搜索:数据库|死循环)
  • ¥15 数据库数据成问号了,前台查询正常,数据库查询是?号
  • ¥15 算法使用了tf-idf,用手肘图确定k值确定不了,第四轮廓系数又太小才有0.006088746097507285,如何解决?(相关搜索:数据处理)
  • ¥15 彩灯控制电路,会的加我QQ1482956179
  • ¥200 相机拍直接转存到电脑上 立拍立穿无线局域网传
  • ¥15 (关键词-电路设计)
  • ¥15 如何解决MIPS计算是否溢出
  • ¥15 vue中我代理了iframe,iframe却走的是路由,没有显示该显示的网站,这个该如何处理