使用scrapy框架爬取腾讯招聘网的数据

问题遇到的现象和发生背景

使用scrapy的框架爬取腾讯招聘的数据时总是报错，错误显示爬取数据不一致，还有SQL的语法错了

问题相关代码，请勿粘贴截图

这个是相关的代码

import scrapy
import json
import math
import urllib.parse
from ..items import TencentItem


class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['careers.tencent.com']
    # start_urls = ['http://careers.tencent.com/']
    job = input("请输入你要搜索的工作岗位：")
    # 对url进行编码
    encode_job = urllib.parse.quote(job)

    one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1639587637815&country" \
              "Id=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}" \
              "&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1639587574036&postId={}" \
              "&language=zh-cn"
    start_urls = [one_url.format(encode_job, 1)]

    def parse(self, response):
        # pass
        # 返回的是json字符串
        json_dic = json.loads(response.text)
        job_counts = int(json_dic['Data']['Count'])
        print(job_counts)
        total_pages = math.ceil(job_counts/10)
        for page in range(1, total_pages+1):
            one_url = self.one_url.format(self.encode_job, page)
            yield scrapy.Request(url=one_url, callback=self.parse_post_ids)

    def pares_post_ids(self, response):
        post = json.loads(response.text)['Data']['Post']
        for p in post:
            post_id = p['PostId']
            two_url = self.two_url.format(post_id)
            yield scrapy.Request(url=two_url, callback=self.parse_job)

    def parse_job(self, response):
        item = TencentItem()
        job = json.load(response.text)['Data']
        item['name'] = job['RecruitPostName']
        item['location'] = job['LocationName']
        item['kind'] = job['CategoryName']
        item['duty'] = job['Responsibility']
        item['requ'] = job['Requirement']
        item['release_time'] = job['LastUpdateTime']
        yield item

运行结果及报错内容

下面是运行遇到的报错

C:\Users\y5263\AppData\Local\Programs\Python\Python37\python.exe C:/Users/y5263/Desktop/Tencent/Tencent/run.py
请输入你要搜索的工作岗位：java
2021-12-24 00:16:21 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: Tencent)
2021-12-24 00:16:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 36.0.0, Platform Windows-10-10.0.19041-SP0
2021-12-24 00:16:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-12-24 00:16:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Tencent',
 'CONCURRENT_REQUESTS': 1,
 'DOWNLOAD_DELAY': 1,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'Tencent.spiders',
 'SPIDER_MODULES': ['Tencent.spiders']}
2021-12-24 00:16:21 [scrapy.extensions.telnet] INFO: Telnet Password: cc872de8bc9eeaf2
2021-12-24 00:16:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled item pipelines:
['Tencent.pipelines.TencentMySQLPipeline',
 'Tencent.pipelines.TencentMongoDBPipeline',
 'Tencent.pipelines.TencentPipeline']
2021-12-24 00:16:22 [scrapy.core.engine] INFO: Spider opened
爬虫开始执行
退出爬虫
2021-12-24 00:16:52 [scrapy.core.engine] INFO: Closing spider (shutdown)
2021-12-24 00:16:52 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method CoreStats.spider_closed of <scrapy.extensions.corestats.CoreStats object at 0x0000023DA16E8DC8>>
Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\utils\defer.py", line 157, in maybeDeferred_coro
    result = f(*args, **kw)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\extensions\corestats.py", line 31, in spider_closed
    elapsed_time = finish_time - self.start_time
TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'NoneType'
2021-12-24 00:16:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'log_count/ERROR': 1, 'log_count/INFO': 8}
2021-12-24 00:16:52 [scrapy.core.engine] INFO: Spider closed (shutdown)
Unhandled error in Deferred:
2021-12-24 00:16:52 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 192, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 196, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1909, in unwindGenerator
    return _cancellableInlineCallbacks(gen)  # type: ignore[unreachable]
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1816, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status)
--- <exception caught here> ---
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")

2021-12-24 00:16:52 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")

Process finished with exit code 1

我的解答思路和尝试过的方法

希望能帮我解答一下

我想要达到的结果

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
DarkAthena ORACLE应用及数据库设计方案咨询师 2021-12-24 15:34
关注
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")

上面这句报错是提示你sql写错了,应该是 truncate table tencent
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

利用scrapy如何爬取图表中的数据 python 其他有问必答
2021-07-07 23:48

回答 1 已采纳 scrapy得出的响应内容是在network的doc里面，如图如果对你有帮助，可以点击我这个回答右上方的【采纳】按钮，给我个采纳吗，谢谢
scrapy-爬取京东笔记本电脑信息问题 chrome python selenium 开发语言
2020-09-01 19:12

回答 2 已采纳 ``` browser.quit() return HtmlResponse(url=request.url, body=browser.page_source, re
Scrapy框架时爬取网页时报错 python 有问必答
2021-05-26 16:56

回答 2 已采纳你的数据清洗方法用错了，参考一下：https://blog.csdn.net/qq_43004728/article/details/84586628，如有帮助，望采纳
scrapy爬取腾讯招聘信息(可运行完整项目)
2018-08-06 19:14

总之，这个“scrapy爬取腾讯招聘信息”项目展示了如何使用Python的Scrapy框架从腾讯招聘网站抓取并处理数据。通过理解Scrapy的组件及其工作原理，我们可以构建出高效的网络爬虫，实现自动化数据采集。对于希望学习...
在以瀑布流方式翻页的网站,使用scrapy网络爬虫,但是只爬取了第一页数据,没有爬取第二页. python 爬虫
2021-09-05 19:18

回答 2 已采纳那叫ajax，
scrapy框架+formdata+ajax爬取及翻页问题 python 数据挖掘测试用例
2020-03-25 14:18

回答 1 已采纳 def parse(self, response): result = eval(response.body.decode('utf-8')) 兄弟，你打印一下resu
scrapy 怎么爬取网页中标签栏下的所有标签? python 爬虫
2022-10-19 14:43

回答 1 已采纳
用scrapy框架爬取拉勾网招聘信息
2020-07-14 13:53

编程歆妍的博客本文实例为爬取拉勾网上的如职位名, 薪资, 公司名称相关python的职位信息。分析思路分析查询结果页在拉勾网搜索框中搜索'python'关键字, 在浏览器地址栏可以看到搜索结果页的url为: '...
使用scrapy框架时导入selenium模块失败 python 爬虫
2021-09-04 13:52

回答 1 已采纳检查一下哪个python.exe执行的这个文件，找到python的完整路径，比如c:\python39\python.exe然后执行 c:\python39\python.exe -c "import
scrapy爬取图片，爬取不到 python 有问必答
2021-05-23 20:32

回答 2 已采纳你已经爬到图片连接了，这个看到的管道文件的代码怎样写，要对图片链接发送请求访问，然后保存才行
使用python scrapy框架写爬虫如何爬取搜狐新闻的参与人数？ python 爬虫
2016-03-29 10:07

回答 2 已采纳这个是可能异步ajax返回的，所以需要用selenium等webdriver来处理
腾讯招聘信息的爬取
2018-07-11 15:37

本项目专注于使用Python3和Scrapy框架来爬取腾讯官网的招聘信息，这将为我们提供宝贵的实时就业市场信息。首先，Python3是当今最流行的编程语言之一，特别是在数据科学和Web开发领域。它具有丰富的库支持，简洁的...
请问Python爬虫如何把爬取数据存入csv文件中 python 开发语言有问必答爬虫
2021-11-21 21:19

回答 1 已采纳你用open打开csv文件，然后以字符串格式写入就行了，每个数据之间用英文逗号隔开即可
Scrapy框架学习笔记 - 爬取腾讯招聘网数据
2021-06-30 16:13

howard2005的博客 Scrapy是一个使用Python实现的，为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛。只需要定制开发几个模块就可以轻松地实现一个爬虫，用来抓取网页内容以及各种图片，非常方便
手把手教你Scrapy框架批量抓取招聘信息
2021-08-26 09:31

菜鸟学Python数据分析的博客相信很多人都希望进腾讯这种大厂工作吧，工资高福利好，那么腾讯公司现在在招哪些职位？职位要求是什么呢？今天我们通过Scrapy框架来爬取腾讯招聘网，一探究竟！爬前分析爬取前我们来简单分析...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 12月24日

悬赏问题

¥15 PointNet++的onnx模型只能使用一次
¥20 西南科技大学数字信号处理
¥15 有两个非常“自以为是”烦人的问题急期待大家解决！
¥30 STM32 INMP441无法读取数据
¥15 R语言绘制密度图，一个密度曲线内fill不同颜色如何实现
¥100 求汇川机器人IRCB300控制器和示教器同版本升级固件文件升级包
¥15 用visualstudio2022创建vue项目后无法启动
¥15 x趋于0时tanx-sinx极限可以拆开算吗
¥500 把面具戴到人脸上，请大家贡献智慧，别用大模型回答，大模型的答案没啥用
¥15 任意一个散点图自己下载其js脚本文件并做成独立的案例页面，不要作在线的，要离线状态。

使用scrapy框架爬取腾讯招聘网的数据

问题遇到的现象和发生背景

问题相关代码，请勿粘贴截图

运行结果及报错内容

我的解答思路和尝试过的方法

我想要达到的结果

1条回答 默认 最新

问题事件

悬赏问题

1条回答默认最新