问题遇到的现象和发生背景
本人由于需求需要爬取部分相关数据,由于需求量较大,我将它部署在了阿里云服务器上,操作系统为Ubuntu,采用的scrapy框架结合selenium爬取数据,但是在爬虫过程中我尝试了本机与服务器爬虫同时运行,本机程序能够持续运行,但服务器程序总会报如下错误:
nohup: ignoring input
2022-08-09 16:58:47 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: Baidubaike_scrapy)
2022-08-09 16:58:47 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Linux-5.4.0-122-generic-x86_64-with-glibc2.29
2022-08-09 16:58:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Baidubaike_scrapy',
'CONCURRENT_REQUESTS': 30,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 0.5,
'FEED_EXPORT_ENCODING': 'utf-8-sig',
'LOG_FILE': 'Baike.log',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'Baidubaike_scrapy.spiders',
'SPIDER_MODULES': ['Baidubaike_scrapy.spiders'],
'USER_AGENT': ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76Mozilla/5.0 (Windows NT 6.1; WOW64) '
'AppleWebKit/537.1 (KHTML, like Gecko) Chrome/97.0.4692.99 '
'Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 '
'(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like '
'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like '
'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like '
'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) '
'AppleWebKit/536.3 (KHTML, like Gecko) Chrome/97.0.4692.99 '
'Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like '
'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like '
'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like '
'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like '
'Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, '
'like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76',
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 '
'(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 '
'Edg/97.0.1072.76']}
2022-08-09 16:58:47 [scrapy.extensions.telnet] INFO: Telnet Password: 64cb9b0424547488
2022-08-09 16:58:47 [py.warnings] WARNING: /usr/local/lib/python3.8/dist-packages/scrapy/extensions/feedexport.py:289: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
exporter = cls(crawler)
2022-08-09 16:58:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-08-09 16:58:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'Baidubaike_scrapy.middlewares.SeleniumMiddlewares',
'Baidubaike_scrapy.middlewares.RandomUserAgent',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-09 16:58:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-09 16:58:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-09 16:58:48 [scrapy.core.engine] INFO: Spider opened
2022-08-09 16:58:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-09 16:58:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-09 16:59:48 [scrapy.extensions.logstats] INFO: Crawled 96 pages (at 96 pages/min), scraped 95 items (at 95 items/min)
2022-08-09 17:00:48 [scrapy.extensions.logstats] INFO: Crawled 180 pages (at 84 pages/min), scraped 179 items (at 84 items/min)
2022-08-09 17:01:49 [scrapy.extensions.logstats] INFO: Crawled 286 pages (at 106 pages/min), scraped 285 items (at 106 items/min)
2022-08-09 17:02:48 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 71 pages/min), scraped 356 items (at 71 items/min)
2022-08-09 17:03:49 [scrapy.extensions.logstats] INFO: Crawled 437 pages (at 80 pages/min), scraped 436 items (at 80 items/min)
2022-08-09 17:04:28 [root] INFO: has error
2022-08-09 17:04:58 [scrapy.extensions.logstats] INFO: Crawled 450 pages (at 13 pages/min), scraped 450 items (at 14 items/min)
2022-08-09 17:04:58 [scrapy.core.scraper] ERROR: Error downloading <GET http://baike.baidu.com/view/518657.htm>
Traceback (most recent call last):
File "/root/scrapy_file/Baidubaike_scrapy/middlewares.py", line 112, in process_request
self.driver.get(request.url)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 447, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 435, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 29.667
(Session info: headless chrome=104.0.5112.79)
Stacktrace:
#0 0x56414047c403 <unknown>
#1 0x564140282778 <unknown>
#2 0x56414026fa88 <unknown>
#3 0x56414026e65b <unknown>
#4 0x56414026ec1c <unknown>
#5 0x56414027ac3f <unknown>
#6 0x56414027b7a2 <unknown>
#7 0x564140289dad <unknown>
#8 0x56414028dc6a <unknown>
#9 0x56414026f046 <unknown>
#10 0x564140289ab4 <unknown>
#11 0x5641402eb078 <unknown>
#12 0x5641402d78f3 <unknown>
#13 0x5641402ad0d8 <unknown>
#14 0x5641402ae205 <unknown>
#15 0x5641404c3e3d <unknown>
#16 0x5641404c6db6 <unknown>
#17 0x5641404ad13e <unknown>
#18 0x5641404c79b5 <unknown>
#19 0x5641404a1970 <unknown>
#20 0x5641404e4228 <unknown>
#21 0x5641404e43bf <unknown>
#22 0x5641404feabe <unknown>
#23 0x7f28b0ce5609 <unknown>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
result = current_context.run(gen.send, result)
File "/usr/local/lib/python3.8/dist-packages/scrapy/core/downloader/middleware.py", line 41, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "/root/scrapy_file/Baidubaike_scrapy/middlewares.py", line 118, in process_request
return HtmlResponse(url=self.driver.current_url,
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 529, in current_url
return self.execute(Command.GET_CURRENT_URL)['value']
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 435, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 30.000
(Session info: headless chrome=104.0.5112.79)
Stacktrace:
#0 0x56414047c403 <unknown>
#1 0x564140282778 <unknown>
#2 0x56414026fa88 <unknown>
#3 0x56414026e65b <unknown>
#4 0x56414026ec1c <unknown>
#5 0x56414027ac3f <unknown>
#6 0x56414027b7a2 <unknown>
#7 0x564140289dad <unknown>
#8 0x56414028dc6a <unknown>
#9 0x56414026f046 <unknown>
#10 0x564140289ab4 <unknown>
#11 0x5641402eab53 <unknown>
#12 0x5641402d78f3 <unknown>
#13 0x5641402ad0d8 <unknown>
#14 0x5641402ae205 <unknown>
#15 0x5641404c3e3d <unknown>
#16 0x5641404c6db6 <unknown>
#17 0x5641404ad13e <unknown>
#18 0x5641404c79b5 <unknown>
#19 0x5641404a1970 <unknown>
#20 0x5641404e4228 <unknown>
#21 0x5641404e43bf <unknown>
#22 0x5641404feabe <unknown>
#23 0x7f28b0ce5609 <unknown>
2022-08-09 17:04:58 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-09 17:04:58 [scrapy.extensions.feedexport] INFO: Stored csv feed (450 items) in: Baidubaike.csv
2022-08-09 17:04:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/selenium.common.exceptions.TimeoutException': 1,
'downloader/response_bytes': 54927681,
'downloader/response_count': 450,
'downloader/response_status_count/200': 450,
'elapsed_time_seconds': 370.020028,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 9, 9, 4, 58, 448223),
'item_scraped_count': 450,
'log_count/ERROR': 1,
'log_count/INFO': 18,
'log_count/WARNING': 1,
'memusage/max': 94121984,
'memusage/startup': 63336448,
'request_depth_max': 450,
'response_received_count': 450,
'scheduler/dequeued': 451,
'scheduler/dequeued/memory': 451,
'scheduler/enqueued': 451,
'scheduler/enqueued/memory': 451,
'start_time': datetime.datetime(2022, 8, 9, 8, 58, 48, 428195)}
2022-08-09 17:04:58 [scrapy.core.engine] INFO: Spider closed (finished)
此处是正常爬取一段时间才出现以上问题的。
相较而言本机程序却能够持续运行,我在middleware尝试了捕获异常重新请求:
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import json
import random
import time
import requests
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from .settings import USER_AGENT
#
#由于发现请求并不会封IP
# class RandomProxy(object):
# # 建立IP代理
# # ==============================================从迅代理获取ip============================================
# def __init__(self):
# # 以下url为讯代理提供ip
# self.url =
# self.proxy = ""
# # self.get_proxyip()
#
# def get_proxyip(self):
# """
# 通过访问url获得proxy
# :return: str类型ip+port
# """
# resp = requests.get(self.url)
# info = json.loads(resp.text)
# # print(resp.text)
# # print(info)
# proxys = info['RESULT'][0]
# print(type(proxys))
# self.proxy = proxys['ip'] + ":" + proxys['port']
# print("==========" + self.proxy + " +=================")
# return self.proxy
#
# # 如果请求返回状态异常更换代理ip重新访问
# def process_response(self, request, response, spider):
# """
# :param request: 请求
# :param response: 返回页面源码
# :param spider: 爬虫
# :return: 如果爬虫爬取出现异常,更换代理/否则返回响应对象response
# """
# '''对返回的response处理'''
# # 如果返回的response状态不是200,重新生成当前request对象
# if response.status != 200:
# proxy = self.get_proxyip()
# print("this is response ip:" + self.proxy)
# # 对当前request加上代理
# request.meta['proxy'] = proxy
# return request
# return response
class RandomUserAgent(object):
"""
设置随机请求头
"""
def __init__(self):
self.user_agents = USER_AGENT
def random_ua(self):
return random.choice(self.user_agents)
#
class SeleniumMiddlewares(object):
error_flag = 0
def __init__(self):
self.chrome_opt = webdriver.ChromeOptions()
prefs = {
"profile.managed_default_content_settings.images": 2,
"plugins.plugins_disabled": ['Chrome PDF Viewer'],
"plugins.plugins_disabled": ['Adobe Flash Player'],
}
self.chrome_opt.add_argument('--no-sandbox')
self.chrome_opt.add_argument('--disable-dev-shm-usage')
self.chrome_opt.add_experimental_option("prefs", prefs)
self.chrome_opt.add_argument('user-agent=' + RandomUserAgent().random_ua())
self.chrome_opt.add_argument("--headless")
# get直接返回,不再等待界面加载完成
# desired_capabilities = DesiredCapabilities.CHROME
# desired_capabilities["pageLoadStrategy"] = "none"
# self.chrome_opt.add_argument("--proxy-server={}".format(RandomProxy().get_proxyip()))
self.driver = webdriver.Chrome(
executable_path=r'./chromedriver',
chrome_options=self.chrome_opt)
def process_request(self, request, spider):
self.driver.set_page_load_timeout(30)
try:
self.driver.get(request.url)
except:
# print("timeout")
self.error_flag += 1
if self.error_flag < 6:
return HtmlResponse(url=self.driver.current_url,
body="",
status=400,
encoding='utf-8')
else:
self.error_flag = 0
return HtmlResponse(url=self.driver.current_url,
body="",
status=404,
encoding='utf-8')
response = HtmlResponse(
url=self.driver.current_url,
body=self.driver.page_source,
status=200,
encoding='utf-8',
)
self.driver.implicitly_wait(1)
return response
将其重新部署到服务器仍然出现上述情况,请问下这种情况是因为没有使用IP代理吗?还是因为服务器掉网了?如果想让scrapy持续运行(即时报错)这种情况该怎么解决,谢谢!
以下是parse部分代码:
def parse(self, response):
# 如果访问超时就再次访问此网页
if response.status == 400:
yield Request(self.new_url, dont_filter=True, callback=self.parse)
# 如果经过五次都访问超时进入下一个界面
elif response.status == 404:
self.page_index += 1
self.new_url = self.url + str(self.page_index) + ".htm"
if self.page_index < 100000:
yield Request(self.new_url, dont_filter=True)