北搭“ 2021-12-24 00:23 采纳率: 0%
浏览 15

使用scrapy框架爬取腾讯招聘网的数据

问题遇到的现象和发生背景

使用scrapy的框架爬取腾讯招聘的数据时总是报错,错误显示爬取数据不一致,还有SQL的语法错了

问题相关代码,请勿粘贴截图

这个是相关的代码

import scrapy
import json
import math
import urllib.parse
from ..items import TencentItem


class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['careers.tencent.com']
    # start_urls = ['http://careers.tencent.com/']
    job = input("请输入你要搜索的工作岗位:")
    # 对url进行编码
    encode_job = urllib.parse.quote(job)

    one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1639587637815&country" \
              "Id=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}" \
              "&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1639587574036&postId={}" \
              "&language=zh-cn"
    start_urls = [one_url.format(encode_job, 1)]

    def parse(self, response):
        # pass
        # 返回的是json字符串
        json_dic = json.loads(response.text)
        job_counts = int(json_dic['Data']['Count'])
        print(job_counts)
        total_pages = math.ceil(job_counts/10)
        for page in range(1, total_pages+1):
            one_url = self.one_url.format(self.encode_job, page)
            yield scrapy.Request(url=one_url, callback=self.parse_post_ids)

    def pares_post_ids(self, response):
        post = json.loads(response.text)['Data']['Post']
        for p in post:
            post_id = p['PostId']
            two_url = self.two_url.format(post_id)
            yield scrapy.Request(url=two_url, callback=self.parse_job)

    def parse_job(self, response):
        item = TencentItem()
        job = json.load(response.text)['Data']
        item['name'] = job['RecruitPostName']
        item['location'] = job['LocationName']
        item['kind'] = job['CategoryName']
        item['duty'] = job['Responsibility']
        item['requ'] = job['Requirement']
        item['release_time'] = job['LastUpdateTime']
        yield item

运行结果及报错内容

下面是运行遇到的报错

C:\Users\y5263\AppData\Local\Programs\Python\Python37\python.exe C:/Users/y5263/Desktop/Tencent/Tencent/run.py
请输入你要搜索的工作岗位:java
2021-12-24 00:16:21 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: Tencent)
2021-12-24 00:16:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 36.0.0, Platform Windows-10-10.0.19041-SP0
2021-12-24 00:16:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-12-24 00:16:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Tencent',
 'CONCURRENT_REQUESTS': 1,
 'DOWNLOAD_DELAY': 1,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'Tencent.spiders',
 'SPIDER_MODULES': ['Tencent.spiders']}
2021-12-24 00:16:21 [scrapy.extensions.telnet] INFO: Telnet Password: cc872de8bc9eeaf2
2021-12-24 00:16:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-24 00:16:22 [scrapy.middleware] INFO: Enabled item pipelines:
['Tencent.pipelines.TencentMySQLPipeline',
 'Tencent.pipelines.TencentMongoDBPipeline',
 'Tencent.pipelines.TencentPipeline']
2021-12-24 00:16:22 [scrapy.core.engine] INFO: Spider opened
爬虫开始执行
退出爬虫
2021-12-24 00:16:52 [scrapy.core.engine] INFO: Closing spider (shutdown)
2021-12-24 00:16:52 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method CoreStats.spider_closed of <scrapy.extensions.corestats.CoreStats object at 0x0000023DA16E8DC8>>
Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\utils\defer.py", line 157, in maybeDeferred_coro
    result = f(*args, **kw)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\extensions\corestats.py", line 31, in spider_closed
    elapsed_time = finish_time - self.start_time
TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'NoneType'
2021-12-24 00:16:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'log_count/ERROR': 1, 'log_count/INFO': 8}
2021-12-24 00:16:52 [scrapy.core.engine] INFO: Spider closed (shutdown)
Unhandled error in Deferred:
2021-12-24 00:16:52 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 192, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 196, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1909, in unwindGenerator
    return _cancellableInlineCallbacks(gen)  # type: ignore[unreachable]
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1816, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status)
--- <exception caught here> ---
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")

2021-12-24 00:16:52 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\y5263\AppData\Local\Programs\Python\Python37\lib\site-packages\scrapy\crawler.py", line 89, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")

Process finished with exit code 1

我的解答思路和尝试过的方法

希望能帮我解答一下

我想要达到的结果
  • 写回答

1条回答 默认 最新

  • DarkAthena ORACLE应用及数据库设计方案咨询师 2021-12-24 15:34
    关注
    pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'truncat table tencent' at line 1")
    

    上面这句报错是提示你sql写错了,应该是 truncate table tencent

    评论

报告相同问题?

问题事件

  • 创建了问题 12月24日

悬赏问题

  • ¥15 PointNet++的onnx模型只能使用一次
  • ¥20 西南科技大学数字信号处理
  • ¥15 有两个非常“自以为是”烦人的问题急期待大家解决!
  • ¥30 STM32 INMP441无法读取数据
  • ¥15 R语言绘制密度图,一个密度曲线内fill不同颜色如何实现
  • ¥100 求汇川机器人IRCB300控制器和示教器同版本升级固件文件升级包
  • ¥15 用visualstudio2022创建vue项目后无法启动
  • ¥15 x趋于0时tanx-sinx极限可以拆开算吗
  • ¥500 把面具戴到人脸上,请大家贡献智慧,别用大模型回答,大模型的答案没啥用
  • ¥15 任意一个散点图自己下载其js脚本文件并做成独立的案例页面,不要作在线的,要离线状态。