引自免费微信小程序:皆我百晓生
为满足上述需求,以下是一种可能的Scrapy爬虫实现方式,包括从古诗文网抓取唐诗三百首页面的所有古诗链接,并进一步抓取每首诗的具体内容。
步骤一:创建项目
scrapy startproject gushiwen
步骤二:创建爬虫
cd gushiwen
scrapy genspider tangshi so.gushiwen.cn/gushi/tangshi.aspx
步骤三:编辑items.py
在gushiwen/items.py中定义数据结构:
import scrapy
class GushiwenItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
content = scrapy.Field()
url = scrapy.Field()
步骤四:修改settings.py
确保ROBOTSTXT_OBEY被设为False,并启用管道:
ROBOTSTEXT_OBEY = False
# Enable or disable pipelines
ITEM_PIPELINES = {
'gushiwen.pipelines.GushiwenPipeline': 300,
}
步骤五:编辑tangshi.py爬虫文件
import scrapy
from gushiwen.items import GushiwenItem
class TangshiSpider(scrapy.Spider):
name = "tangshi"
allowed_domains = ["gushiwen.cn"]
start_urls = [
"https://so.gushiwen.cn/gushi/tangshi.aspx",
]
def parse(self, response):
# 提取每个古诗的链接
poem_links = response.xpath('//div[@class="sons"]/a/@href').extract()
for link in poem_links:
yield scrapy.Request(url=link, callback=self.parse_poem)
def parse_poem(self, response):
item = GushiwenItem()
item['title'] = response.xpath('//h1/text()').get()
item['author'] = response.xpath('//p[@class="source"]/a[1]/text()').get()
item['content'] = response.xpath('//div[@class="contson"]/text()').getall()
item['url'] = response.url
yield item
步骤六:编辑pipelines.py
import json
class GushiwenPipeline:
def open_spider(self, spider):
self.file = open('poems.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
运行爬虫
scrapy crawl tangshi
这将抓取唐诗三百首页面上的所有诗歌链接,并对每首诗进行深度抓取,获取标题、作者、内容以及URL,最后将结果保存到poems.json文件中。