我爬的是凤凰网科技板块的新闻。https://tech.ifeng.com/c/7y9hTyzHqvm
代码如下:
spider:
class FenghuangkejiSpider(CrawlSpider):
name = 'phm0618'
allowed_domains = ['tech.ifeng.com']
start_urls = ['https://tech.ifeng.com/']
custom_settings = {
'LOG_LEVEL': 'DEBUG', # Log等级,默认是最低级别debug
'ROBOTSTXT_OBEY': False, # default Obey robots.txt rules
'DOWNLOAD_DELAY': 0, # 下载延时,默认是0
'COOKIES_ENABLED': False, # 默认enable,爬取登录后数据时需要启用
'DOWNLOAD_TIMEOUT': 25, # 下载超时,既可以是爬虫全局统一控制,也可以在具体请求中填入到Request.meta中,Request.meta['download_timeout']
'RETRY_TIMES': 8,
}
rules = (
Rule(LinkExtractor(allow=r'.*/c/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
head = response.xpath('//*[@id="root"]/div/div[3]/div[1]/div[1]/div[1]/h1/text()').extract()
content = response.xpath('//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]/div/div[1]/p/text()').extract()
url = response.url
print(url)
item = Phm0519Item(head = head,content = content)
yield item
运行结果是爬取了176条爬虫就显示finished。实在搞不懂什么原因,恳请各位解答。