问题遇到的现象和发生背景
使用scrapy的框架爬取腾讯招聘的数据时总是报错,错误显示爬取数据不一致,还有SQL的语法错了
问题相关代码,请勿粘贴截图
这个是相关的代码
import scrapy
import json
import math
import urllib.parse
from ..items import TencentItem
class TencentSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['careers.tencent.com']
# start_urls = ['http://careers.tencent.com/']
job = input("请输入你要搜索的工作岗位:")
# 对url进行编码
encode_job = urllib.parse.quote(job)
one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1639587637815&country" \
"Id=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}" \
"&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1639587574036&postId={}" \
"&language=zh-cn"
start_urls = [one_url.format(encode_job, 1)]
def parse(self, response):
# pass
# 返回的是json字符串
json_dic = json.loads(response.text)
job_counts = int(json_dic['Data']['Count'])
print(job_counts)
total_pages = math.ceil(job_counts/10)
for page in range(1, total_pages+1):
one_url = self.one_url.format(self.encode_job, page)
yield scrapy.Request(url=one_url, callback=self.parse_post_ids)
def pares_post_ids(self, response):
post = json.loads(response.text)['Data']['Post']
for p in post:
post_id = p['PostId']
two_url = self.two_url.format(post_id)
yield scrapy.Request(url=two_url, callback=self.parse_job)
def parse_job(self, response):
item = TencentItem()
job = json.load(response.text)['Data']
item['name'] = job['RecruitPostName']
item['location'] = job['LocationName']
item['kind'] = job['CategoryName']
item['duty'] = job['Responsibility']
item['requ'] = job['Requirement']
item['release_time'] = job['LastUpdateTime']
yield item
运行结果及报错内容
下面是运行遇到的报错
C:\Users\y5263\AppData\Local\Programs\Python\Python37\python.exe C:/Users/y5263/Desktop/Tencent/Tencent/run.py
请输入你要搜索的工作岗位:java
2021-12-24 00:16:21 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: Tencent)
2021-12-24 00:16:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 36.0.0, Platform Windows-10-10.0.19041-SP0
2021-12-24 00:16:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-12-24 00:16:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Tencent',
'CONCURRENT_REQUESTS': 1,
'DOWNLOAD_DELAY': 1,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'Tencent.spiders',
'SPIDER_MODULES': ['Tencent.spiders']}
2021-12-24 00:16:21 [scrapy.extensions.telnet] INFO: Telnet Password: cc872de8bc9eeaf2
2021-12-24 00:16:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled item pipelines:
['Tencent.pipelines.TencentMySQLPipeline',
'Tencent.pipelines.TencentMongoDBPipeline',
'Tencent.pipelines.TencentPipeline']
2021-12-24 00:16:22 [scrapy.core.engine] INFO: Spider opened
爬虫开始执行
退出爬虫
2021-12-24 00:16:52 [scrapy.core.engine] INFO: Closing spider (shutdown)
2021-12-24 00:16:52 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method CoreStats.spider_closed of <scrapy.extensions.corestats.CoreStats object at 0x0000023DA16E8DC8>>
Traceback (most recent call last):
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\utils\defer.py", line 157, in maybeDeferred_coro
result = f(*args, **kw)
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\extensions\corestats.py", line 31, in spider_closed
elapsed_time = finish_time - self.start_time
TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'NoneType'
2021-12-24 00:16:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'log_count/ERROR': 1, 'log_count/INFO': 8}
2021-12-24 00:16:52 [scrapy.core.engine] INFO: Spider closed (shutdown)
Unhandled error in Deferred:
2021-12-24 00:16:52 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 192, in crawl
return self._crawl(crawler, *args, **kwargs)
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 196, in _crawl
d = crawler.crawl(*args, **kwargs)
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1909, in unwindGenerator
return _cancellableInlineCallbacks(gen) # type: ignore[unreachable]
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1816, in _cancellableInlineCallbacks
_inlineCallbacks(None, gen, status)
--- <exception caught here> ---
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1661, in _inlineCallbacks
result = current_context.run(gen.send, result)
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")
2021-12-24 00:16:52 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1661, in _inlineCallbacks
result = current_context.run(gen.send, result)
File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")
Process finished with exit code 1
我的解答思路和尝试过的方法
希望能帮我解答一下