and_ylu
2017-10-08 10:46 阅读 3.4k

在scrapy中如何实现在多个页面上进行爬取

目标是 爬取http://download.kaoyan.com/list-1到http://download.kaoyan.com/list-1500之间的内容,每个页面中的又有翻页的list-1p1到list-1p20。目前只能实现在list1p上面爬取,应该如何修改代码跳转到list-6上面?list-2是404

# -*- coding: utf-8 -*-
import scrapy
from Kaoyan.items import KaoyanItem

class KaoyanbangSpider(scrapy.Spider):
    name = "Kaoyanbang"
    allowed_domains = ['kaoyan.com']
    baseurl = 'http://download.kaoyan.com/list-'
    linkuseurl = 'http://download.kaoyan.com'
    offset = 1
    pset = 1

    start_urls = [baseurl+str(offset)+'p'+str(pset)]

    handle_httpstatus_list = [404, 500]

    def parse(self, response):
        node_list = response.xpath('//table/tr/th/span/a')
        for node in node_list:
            item = KaoyanItem()
            item['name'] = node.xpath('./text()').extract()[0].encode('utf - 8')
            item['link'] = (self.linkuseurl + node.xpath('./@href').extract()[0]).encode('utf-8')
            yield item

        while self.offset < 1500:
            while self.pset < 50:
                self.pset = self.pset + 1
                url = self.baseurl+str(self.offset)+'p'+str(self.pset)
                y = scrapy.Request(url, callback=self.parse)
                yield y
            self.offset = self.offset + 5
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

相关推荐