用scrapy爬到一半时停止并报错invalid session id

用scrapy做了一个爬虫，总共爬了三次，并同时在三个网站爬，总共四五千个数据，每次都有一个网站在其他的爬完之后突然停止并且报错

selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id

每次停止的位置也不一样，网上说这个错误是webdriver在关闭后再次调用才出现，但我觉着不像是代码问题，不然每次爬取结果不该不同啊
希望有人解答
同时附上middleware文件

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from selenium import  webdriver
import time
import scrapy
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class JobHuntingSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class JobHuntingDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def __init__(cls):
        cls.BUPT_driver = webdriver.Chrome()
        cls.UESTC_driver = webdriver.Chrome()
        cls.XIDIAN_driver = webdriver.Chrome()

    @classmethod
    def __del__(cls):
        cls.BUPT_driver.close()
        cls.UESTC_driver.close()
        cls.XIDIAN_driver.close()

    @classmethod
    def get_BUPT_driver(cls):
        return cls.BUPT_driver

    @classmethod
    def get_UESTC_driver(cls):
        return cls.UESTC_driver

    @classmethod
    def get_XIDIAN_driver(cls):
        return cls.XIDIAN_driver

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        if spider.name=="UESTC":
            self.UESTC_driver.get(request.url)
            time.sleep(2)
            return scrapy.http.HtmlResponse(url=request.url,body=self.UESTC_driver.page_source.encode('utf-8'),encoding='utf-8',request=request,status=200)
        elif spider.name=="XIDIAN":
            self.XIDIAN_driver.get(request.url)
            time.sleep(2)
            return scrapy.http.HtmlResponse(url=request.url,body=self.XIDIAN_driver.page_source.encode('utf-8'),encoding='utf-8',request=request,status=200)
        elif spider.name=="BUPT":
            self.BUPT_driver.get(request.url)
            time.sleep(2)
            return scrapy.http.HtmlResponse(url=request.url,body=self.BUPT_driver.page_source.encode('utf-8'),encoding='utf-8',request=request,status=200)

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

和出问题的文件代码

import scrapy
from job_hunting.items import JobHuntingItem
from job_hunting.middlewares import JobHuntingDownloaderMiddleware
import time
from datetime import datetime

class mySpider(scrapy.spiders.Spider):
    name="XIDIAN"
    allow_domains=['job.xidian.edu.cn']
    start_urls = ["https://job.xidian.edu.cn/campus/index?domain=xidian&city=&page=1"]

    xidian_next_page = ''
    def parse(self, response):
        item=JobHuntingItem()
        next_page_href = response.css('li[class="next"]>a::attr(href)').extract()
        last_page_href = response.css('li[class="last"]>a::attr(href)').extract()
        if next_page_href != last_page_href:
            self.xidian_next_page = 'https://job.xidian.edu.cn' + next_page_href[0]
        else:
            self.xidian_next_page = ''
        c_page_url_list = response.css('ul[class="infoList"]>li:nth-child(1)>a')
        for job in c_page_url_list:
            driver = JobHuntingDownloaderMiddleware.get_XIDIAN_driver()
            driver.get('https://job.xidian.edu.cn' + job.css('a::attr(href)').extract()[0])
            time.sleep(2)
            item['job_title'] = [driver.find_element('css selector', 'div[class="info-left"]>div>h5').text]
            date_text = driver.find_element('css selector', 'div[class="share"]>ul>li:nth-child(1)').text
            date_text = date_text[date_text.find('：') + 1:]
            if datetime.strptime(date_text,'%Y-%m-%d %H:%M')<datetime.strptime('2021-09-01 00:00','%Y-%m-%d %H:%M'):
                self.xidian_next_page = ''
                break
            item['job_date'] = [date_text]
            views_text = driver.find_element('css selector', 'div[class="share"]>ul>li:nth-child(2)').text
            item['job_views'] = [views_text[views_text.find('：') + 1:]]
            item['job_number']=['None']
            yield item
        if self.xidian_next_page != '':
            yield scrapy.Request(self.xidian_next_page, callback=self.parse)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
晴泪 2022-01-06 18:53
关注
这位博友情况跟你的有点类似，你可以借鉴一下 https://blog.csdn.net/weixin_35757704/article/details/120706276

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报编辑记录

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

用scrapy爬到一半时停止并报错invalid session id python 爬虫
2022-01-04 16:58

回答 1 已采纳这位博友情况跟你的有点类似，你可以借鉴一下 https://blog.csdn.net/weixin_35757704/article/details/120706276
爬虫：无法获取到sessionID python 有问必答
2022-05-20 12:09

回答 2 已采纳如果登陆了，会缓存在浏览器端。每次请求接口时会携带这个sessionId，包含在cookies中。
scrapy如何手动停止爬虫？ python
2021-05-10 09:54

回答 1 已采纳 Ctrl+C 只是终止主线程,你的其他线程没有守护,所以 Ctrl+C 后它们继续运行。另外scrapy中的 Ctrl+C 是暂停，并不是完全停止，Ctrl+C 是断点续爬的基础。
Scrapy与分布式开发(2.1.1)：python常用网络请求库requests
2024-02-27 16:17

九月镇灵将的博客它允许你使用Python语言发送所有类型的HTTP请求，如GET、POST、PUT、DELETE等。`requests`模块基于urllib3开发，但比urllib3更加简单易用。它提供了丰富的API，使得发送HTTP请求和处理响应变得轻而易举。
scrapy模块进行爬虫报错 python 爬虫
2022-12-27 23:14

回答 1 已采纳望采纳！点击该回答右侧的“采纳”按钮即可采纳！！！我猜测可能是因为没有在你的项目目录下创建这个模块，或者是你在项目的 settings.py 文件中没有指定正确的模块路径。你需要确保在你的项目目录下有
已经在cmd安装了scrapy，为什么import scrapy还是报错? python
2022-11-09 16:33

回答 2 已采纳在pycharm终端在安装一次就好了：
爬虫scrapy框架为什么items一直报错 python 有问必答
2021-12-23 22:10

回答 1 已采纳你这个文件是house.py，然后又从这个house引入，改一下文件名
Scrapy与分布式开发(2.1.2)：python常用网络请求库httpx
2024-02-27 16:34

九月镇灵将的博客 `httpx` 是一个用于发送 HTTP 请求的 Python 库，它提供了简单易用的 API，支持同步和异步请求，并且具有出色的性能和灵活性。`httpx` 是 `requests` 的一个现代替代品，它使用 `httpcore` 作为底层传输层，支持 ...
python爬虫scrapy python 有问必答
2021-07-22 10:03

回答 2 已采纳看下数据是否是动态加载的，多抓几次包，分析下；可能需要通过添加page参数，进行爬取！
python scrapy爬虫如果想要下一页但是没有href python 爬虫
2022-12-14 00:18

回答 1 已采纳你要模拟参数，具体代码如下： import http.client conn = http.client.HTTPSConnection("chl.cn") #page 5 #submit 下一页
利用Scrapy框架爬虫时出现报错ModuleNotFoundError: No module named 'scrapytest.NewsItems'？ python
2019-11-15 23:52

回答 2 已采纳 import scrapy #引入容器 from scrapytest.NewsItems import NewsItem 改为 from scrapytest.items import Ne
爬虫日记(28)：scrapy使用中间件调用浏览器
2021-03-29 10:09

caimouse的博客前面已经学习过怎么样使用Python + selenium + webdriver + chrome方案来抓取数据，现在来更进一步学习。因为scrapy一般情况下只适合抓取在服务器端静态生成的网页，而不适合在客户端动态生成的网页。为什么这样说呢...
为什么我的scrapy爬不到数据了 python
2020-09-05 13:48

回答 1 已采纳 small_link = 'http:'+li.xpath('./@href').extract_first() 这里错了 response.urljoin(li.xpath('./@href')
Scrapy爬取页面错误原因汇总
2018-11-01 00:58

EUNC的博客 url = response.selector.xpath(’//*[@class=‘lbf-pagination-item-list’]//li[9]/a/@href’).extract()[0] print(url) ...如上图代码，scrapy 爬虫过程中，在实现翻页时，偶然遇见如下报错： Missin...
Python爬虫（入门+进阶）学习笔记 2-5 Scrapy的中间件
2018-06-30 12:49

kissazhu的博客上一节我们学习怎么去保存爬取的结果，然而大多数时候裸奔的请求很容易被网站反爬技术识别，导致并不能获取到我们想要的数据，我们该怎么做呢？中间件就可以帮你解决这些事下载中间件（Downloader middlewares）...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 3月1日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已采纳回答 2月21日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 1月4日

悬赏问题

¥15 在若依框架下实现人脸识别
¥15 网络科学导论，网络控制
¥100 安卓tv程序连接SQLSERVER2008问题
¥15 利用Sentinel-2和Landsat8做一个水库的长时序NDVI的对比，为什么Snetinel-2计算的结果最小值特别小，而Lansat8就很平均
¥15 metadata提取的PDF元数据，如何转换为一个Excel
¥15 关于arduino编程toCharArray()函数的使用
¥100 vc++混合CEF采用CLR方式编译报错
¥15 coze 的插件输入飞书多维表格 app_token 后一直显示错误，如何解决？
¥15 vite+vue3+plyr播放本地public文件夹下视频无法加载
¥15 c#逐行读取txt文本，但是每一行里面数据之间空格数量不同

用scrapy爬到一半时停止并报错invalid session id

1条回答 默认 最新

问题事件

悬赏问题

1条回答默认最新