ALGORITHM LOL 2022-08-09 17:43 采纳率: 91.3%
浏览 46
已结题

scrapy部署在服务器运行一段时间出现ERROR: Error downloading

问题遇到的现象和发生背景

本人由于需求需要爬取部分相关数据,由于需求量较大,我将它部署在了阿里云服务器上,操作系统为Ubuntu,采用的scrapy框架结合selenium爬取数据,但是在爬虫过程中我尝试了本机与服务器爬虫同时运行,本机程序能够持续运行,但服务器程序总会报如下错误:

nohup: ignoring input
2022-08-09 16:58:47 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: Baidubaike_scrapy)
2022-08-09 16:58:47 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Linux-5.4.0-122-generic-x86_64-with-glibc2.29
2022-08-09 16:58:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Baidubaike_scrapy',
 'CONCURRENT_REQUESTS': 30,
 'COOKIES_ENABLED': False,
 'DOWNLOAD_DELAY': 0.5,
 'FEED_EXPORT_ENCODING': 'utf-8-sig',
 'LOG_FILE': 'Baike.log',
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'Baidubaike_scrapy.spiders',
 'SPIDER_MODULES': ['Baidubaike_scrapy.spiders'],
 'USER_AGENT': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                '(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76Mozilla/5.0 (Windows NT 6.1; WOW64) '
                'AppleWebKit/537.1 (KHTML, like Gecko) Chrome/97.0.4692.99 '
                'Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 '
                '(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, '
                'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like '
                'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, '
                'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, '
                'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like '
                'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, '
                'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like '
                'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) '
                'AppleWebKit/536.3 (KHTML, like Gecko) Chrome/97.0.4692.99 '
                'Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like '
                'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, '
                'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like '
                'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, '
                'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like '
                'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like '
                'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
                'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, '
                'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76',
                'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 '
                '(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
                'Edg/97.0.1072.76']}
2022-08-09 16:58:47 [scrapy.extensions.telnet] INFO: Telnet Password: 64cb9b0424547488
2022-08-09 16:58:47 [py.warnings] WARNING: /usr/local/lib/python3.8/dist-packages/scrapy/extensions/feedexport.py:289: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
  exporter = cls(crawler)

2022-08-09 16:58:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-08-09 16:58:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'Baidubaike_scrapy.middlewares.SeleniumMiddlewares',
 'Baidubaike_scrapy.middlewares.RandomUserAgent',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-09 16:58:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-09 16:58:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-09 16:58:48 [scrapy.core.engine] INFO: Spider opened
2022-08-09 16:58:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-09 16:58:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-09 16:59:48 [scrapy.extensions.logstats] INFO: Crawled 96 pages (at 96 pages/min), scraped 95 items (at 95 items/min)
2022-08-09 17:00:48 [scrapy.extensions.logstats] INFO: Crawled 180 pages (at 84 pages/min), scraped 179 items (at 84 items/min)
2022-08-09 17:01:49 [scrapy.extensions.logstats] INFO: Crawled 286 pages (at 106 pages/min), scraped 285 items (at 106 items/min)
2022-08-09 17:02:48 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 71 pages/min), scraped 356 items (at 71 items/min)
2022-08-09 17:03:49 [scrapy.extensions.logstats] INFO: Crawled 437 pages (at 80 pages/min), scraped 436 items (at 80 items/min)
2022-08-09 17:04:28 [root] INFO: has error
2022-08-09 17:04:58 [scrapy.extensions.logstats] INFO: Crawled 450 pages (at 13 pages/min), scraped 450 items (at 14 items/min)
2022-08-09 17:04:58 [scrapy.core.scraper] ERROR: Error downloading <GET http://baike.baidu.com/view/518657.htm>
Traceback (most recent call last):
  File "/root/scrapy_file/Baidubaike_scrapy/middlewares.py", line 112, in process_request
    self.driver.get(request.url)
  File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 447, in get
    self.execute(Command.GET, {'url': url})
  File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 435, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 29.667
  (Session info: headless chrome=104.0.5112.79)
Stacktrace:
#0 0x56414047c403 <unknown>
#1 0x564140282778 <unknown>
#2 0x56414026fa88 <unknown>
#3 0x56414026e65b <unknown>
#4 0x56414026ec1c <unknown>
#5 0x56414027ac3f <unknown>
#6 0x56414027b7a2 <unknown>
#7 0x564140289dad <unknown>
#8 0x56414028dc6a <unknown>
#9 0x56414026f046 <unknown>
#10 0x564140289ab4 <unknown>
#11 0x5641402eb078 <unknown>
#12 0x5641402d78f3 <unknown>
#13 0x5641402ad0d8 <unknown>
#14 0x5641402ae205 <unknown>
#15 0x5641404c3e3d <unknown>
#16 0x5641404c6db6 <unknown>
#17 0x5641404ad13e <unknown>
#18 0x5641404c79b5 <unknown>
#19 0x5641404a1970 <unknown>
#20 0x5641404e4228 <unknown>
#21 0x5641404e43bf <unknown>
#22 0x5641404feabe <unknown>
#23 0x7f28b0ce5609 <unknown>


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/usr/local/lib/python3.8/dist-packages/scrapy/core/downloader/middleware.py", line 41, in process_request
    response = yield deferred_from_coro(method(request=request, spider=spider))
  File "/root/scrapy_file/Baidubaike_scrapy/middlewares.py", line 118, in process_request
    return HtmlResponse(url=self.driver.current_url,
  File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 529, in current_url
    return self.execute(Command.GET_CURRENT_URL)['value']
  File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 435, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 30.000
  (Session info: headless chrome=104.0.5112.79)
Stacktrace:
#0 0x56414047c403 <unknown>
#1 0x564140282778 <unknown>
#2 0x56414026fa88 <unknown>
#3 0x56414026e65b <unknown>
#4 0x56414026ec1c <unknown>
#5 0x56414027ac3f <unknown>
#6 0x56414027b7a2 <unknown>
#7 0x564140289dad <unknown>
#8 0x56414028dc6a <unknown>
#9 0x56414026f046 <unknown>
#10 0x564140289ab4 <unknown>
#11 0x5641402eab53 <unknown>
#12 0x5641402d78f3 <unknown>
#13 0x5641402ad0d8 <unknown>
#14 0x5641402ae205 <unknown>
#15 0x5641404c3e3d <unknown>
#16 0x5641404c6db6 <unknown>
#17 0x5641404ad13e <unknown>
#18 0x5641404c79b5 <unknown>
#19 0x5641404a1970 <unknown>
#20 0x5641404e4228 <unknown>
#21 0x5641404e43bf <unknown>
#22 0x5641404feabe <unknown>
#23 0x7f28b0ce5609 <unknown>

2022-08-09 17:04:58 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-09 17:04:58 [scrapy.extensions.feedexport] INFO: Stored csv feed (450 items) in: Baidubaike.csv
2022-08-09 17:04:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/selenium.common.exceptions.TimeoutException': 1,
 'downloader/response_bytes': 54927681,
 'downloader/response_count': 450,
 'downloader/response_status_count/200': 450,
 'elapsed_time_seconds': 370.020028,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 9, 9, 4, 58, 448223),
 'item_scraped_count': 450,
 'log_count/ERROR': 1,
 'log_count/INFO': 18,
 'log_count/WARNING': 1,
 'memusage/max': 94121984,
 'memusage/startup': 63336448,
 'request_depth_max': 450,
 'response_received_count': 450,
 'scheduler/dequeued': 451,
 'scheduler/dequeued/memory': 451,
 'scheduler/enqueued': 451,
 'scheduler/enqueued/memory': 451,
 'start_time': datetime.datetime(2022, 8, 9, 8, 58, 48, 428195)}
2022-08-09 17:04:58 [scrapy.core.engine] INFO: Spider closed (finished)

此处是正常爬取一段时间才出现以上问题的。

相较而言本机程序却能够持续运行,我在middleware尝试了捕获异常重新请求:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import json
import random
import time

import requests
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

from .settings import USER_AGENT


#
#由于发现请求并不会封IP
# class RandomProxy(object):
#     # 建立IP代理
#     # ==============================================从迅代理获取ip============================================
#     def __init__(self):
#         # 以下url为讯代理提供ip
#         self.url = 
#         self.proxy = ""
#         # self.get_proxyip()
#
#     def get_proxyip(self):
#         """
#         通过访问url获得proxy
#         :return: str类型ip+port
#         """
#         resp = requests.get(self.url)
#         info = json.loads(resp.text)
#         # print(resp.text)
#         # print(info)
#         proxys = info['RESULT'][0]
#         print(type(proxys))
#         self.proxy = proxys['ip'] + ":" + proxys['port']
#         print("==========" + self.proxy + " +=================")
#         return self.proxy
#
#     # 如果请求返回状态异常更换代理ip重新访问
#     def process_response(self, request, response, spider):
#         """
#         :param request: 请求
#         :param response: 返回页面源码
#         :param spider: 爬虫
#         :return: 如果爬虫爬取出现异常,更换代理/否则返回响应对象response
#         """
#         '''对返回的response处理'''
#         # 如果返回的response状态不是200,重新生成当前request对象
#         if response.status != 200:
#             proxy = self.get_proxyip()
#             print("this is response ip:" + self.proxy)
#             # 对当前request加上代理
#             request.meta['proxy'] = proxy
#             return request
#         return response


class RandomUserAgent(object):
    """
    设置随机请求头
    """

    def __init__(self):
        self.user_agents = USER_AGENT

    def random_ua(self):
        return random.choice(self.user_agents)


#
class SeleniumMiddlewares(object):
    error_flag = 0

    def __init__(self):
        self.chrome_opt = webdriver.ChromeOptions()
        prefs = {
            "profile.managed_default_content_settings.images": 2,
            "plugins.plugins_disabled": ['Chrome PDF Viewer'],
            "plugins.plugins_disabled": ['Adobe Flash Player'],
        }
        self.chrome_opt.add_argument('--no-sandbox')
        self.chrome_opt.add_argument('--disable-dev-shm-usage')
        self.chrome_opt.add_experimental_option("prefs", prefs)
        self.chrome_opt.add_argument('user-agent=' + RandomUserAgent().random_ua())
        self.chrome_opt.add_argument("--headless")

        # get直接返回,不再等待界面加载完成
        # desired_capabilities = DesiredCapabilities.CHROME
        # desired_capabilities["pageLoadStrategy"] = "none"

        # self.chrome_opt.add_argument("--proxy-server={}".format(RandomProxy().get_proxyip()))
        self.driver = webdriver.Chrome(
            executable_path=r'./chromedriver',
            chrome_options=self.chrome_opt)

    def process_request(self, request, spider):
        self.driver.set_page_load_timeout(30)
        try:
            self.driver.get(request.url)
        except:
            # print("timeout")
            self.error_flag += 1
            if self.error_flag < 6:
                return HtmlResponse(url=self.driver.current_url,
                                    body="",
                                    status=400,
                                    encoding='utf-8')
            else:
                self.error_flag = 0
                return HtmlResponse(url=self.driver.current_url,
                                    body="",
                                    status=404,
                                    encoding='utf-8')
        response = HtmlResponse(
            url=self.driver.current_url,
            body=self.driver.page_source,
            status=200,
            encoding='utf-8',
        )
        self.driver.implicitly_wait(1)
        return response

将其重新部署到服务器仍然出现上述情况,请问下这种情况是因为没有使用IP代理吗?还是因为服务器掉网了?如果想让scrapy持续运行(即时报错)这种情况该怎么解决,谢谢!

以下是parse部分代码:

    def parse(self, response):
        # 如果访问超时就再次访问此网页
        if response.status == 400:
            yield Request(self.new_url, dont_filter=True, callback=self.parse)
        # 如果经过五次都访问超时进入下一个界面
        elif response.status == 404:
            self.page_index += 1
            self.new_url = self.url + str(self.page_index) + ".htm"
            if self.page_index < 100000:
                yield Request(self.new_url, dont_filter=True)
  • 写回答

1条回答 默认 最新

  • 快乐小土狗 2022-08-09 18:05
    关注

    服务器掉网??应该不会,你在服务器的那个控制平台不是可以看网络监控嘛?应该不是断网的问题。
    可能就是被反爬,你爬的数据多,一段时间内请求多,一般都会被反扒,而你又没有设置代码

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

问题事件

  • 系统已结题 8月18日
  • 已采纳回答 8月10日
  • 创建了问题 8月9日

悬赏问题

  • ¥20 树莓派5做人脸情感识别与反馈系统
  • ¥15 selenium 控制 chrome-for-testing 在 Linux 环境下报错 SessionNotCreatedException
  • ¥15 使用pyodbc操作SQL数据库
  • ¥15 MATLAB实现下列
  • ¥30 mininet可视化打不开.mn文件
  • ¥50 C# 全屏打开Edge浏览器
  • ¥80 WEBPACK性能优化
  • ¥30 python拟合回归分析
  • ¥500 metaswitch 6010
  • ¥15 关于#分类#的问题:不用人工智能的算法