Scrapy Scraper问题

I am trying to use Scrapy to scrape - www.paytm.com . The website uses AJAX Requests, in the form of XHR to display search results.

I managed to track down the XHR, and the AJAX response is SIMILAR to JSON, but it isn't actually JSON.

This is the link for one of the XHR request - https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6 . If you see the URL correctly, The parameter - page_count - is responsible for showing different pages of results, and the parameter - userQuery - is responsible for the search query that is passed to the website.

Now, if you see the response correctly. It isn't actually JSON, only looks similar to JSON ( I veified it on http://jsonlint.com/ ) . I want to scrape this using SCRAPY ( SCRAPY only because since it is a framework, it would be faster than using other libraries like BeautifulSoup, because using them to create a scraper that scrapes at such a high speed would take a lot effort - That is the only reason why I want to use Scrapy. ) .

Now, This is my snippet of code, that I used to extract the JSON Response from the URL -:

    jsonresponse = json.loads(response.body_as_unicode())
    print json.dumps(jsonresponse, indent=4, sort_keys=True)

On executing the code, it throws me an error stating-:

2015-07-05 12:13:23 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-05 12:13:23 [scrapy] INFO: Optional features available: ssl, http11
2015-07-05 12:13:23 [scrapy] INFO: Overridden settings: {'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'CONCURRENT_REQUESTS': 100}
2015-07-05 12:13:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-05 12:13:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-05 12:13:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-05 12:13:23 [scrapy] INFO: Enabled item pipelines: 
2015-07-05 12:13:23 [scrapy] INFO: Spider opened
2015-07-05 12:13:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-05 12:13:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-05 12:13:24 [scrapy] DEBUG: Crawled (200) <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
2015-07-05 12:13:24 [scrapy] ERROR: Spider error processing <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "Startup App/SCRAPERS/paytmscraper_scrapy/paytmspiderscript.py", line 111, in parse
    jsonresponse = json.loads(response.body_as_unicode())
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2015-07-05 12:13:24 [scrapy] INFO: Closing spider (finished)
2015-07-05 12:13:24 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 343,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6483,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 5, 6, 43, 24, 733187),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2015, 7, 5, 6, 43, 23, 908135)}
2015-07-05 12:13:24 [scrapy] INFO: Spider closed (finished)

Now, my Question, How do I scrape such a response using Scrapy? If any other code is required, feel free to ask in the comments. I shall willingly give it!

Please provide the entire code related to this. It would be well appreciated! Maybe some manipulation of the JSON Response (from python) (similar to string comparison) would also work for me, if it can help me scrape this!

P.S: I can't modify the JSON Response manually (using hand) every time because this is the response that is given by the website. So, please suggest a programmatic (pythonic) way to do this. Preferably, I want to use Scrapy as my framework.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
weixin_33676492 2015-07-05 09:59
关注
If you look at the not-JSON result it is clear that it contains a JSON.

If you remove from the response the typeof angular.callbacks._6 === "function" && angular.callbacks._6( initial part and ); at the end you get a valid JSON which you can validate with JSONLint.

Eventually the solution is to find the first and last occurrence of { and } respectively, in the response and extract the text inside (inclusive of those curly brackets) and use that with json.loads instead of the whole result.

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

scrapy安装失败的问题 python 有问必答
2021-04-17 21:20

回答 4 已采纳 can't find Rust compiler，找不到Rust编译器
scrapy框架问题报错？ python
2020-03-22 13:23

回答 1 已采纳 https://www.cnblogs.com/ArsenalfanInECNU/p/5346751.html
flask+scrapy的爬虫问题 flask json python 爬虫
2018-01-29 07:55

回答 3 已采纳你拿到参数后先写一个bat批处理 scrapy crawl myspider -a category=electronics 再 os.system(r'xx.bat') 来调用 https://d
scrapy爬虫折腾系列-02
2019-09-19 15:26

JunJunTech的博客 Scrapy折腾系列-02 1、笔记 response是一个scrapy.http.response.html.HtmlResponse对象，可执行xpath和css语法来提取数据提取出来的数据，是一个 Selector或者是一个selectorList对象，要想获取其中的字符串，得...
关于scrapy 无法启动的问题 python
2022-08-22 21:07

回答 2 已采纳这是在两个不同py文件，要使用CnblogspiderItem需要导入另一个文件在Cnblog...py文件开头导入另一个文件import Item（反正就是另一个文件的文件名，图片模糊有些看不清）
scrapy爬虫不自动翻页问题爬虫
2021-11-14 09:17

回答 2 已采纳 scrapy框架里面 start_urls里面装的是网页列表，你在上面贴的代码里只放了一个url，所以他只会一直爬这一个网页。用for循环构造出url，然后添加进statrt_urls，然后再运行就解
scrapy通用爬虫parse解析中的问题 python 爬虫
2022-10-16 01:14

回答 1 已采纳检查一下parse item 函数的response是否是正常的
Chrome 爬虫插件 Web Scraper
2020-04-23 00:54

擒贼先擒王的博客 Web Scraper 系列教程：...Web Scraper 高级用法——如何导入别人已经写好的 Web Scraper 爬虫__06： https://www.cnblogs.com/web-scraper/p/import_export_sitema...
关于#scrapy#的问题，如何解决？ python 爬虫
2023-03-07 18:36

回答 2 已采纳从代码看，你的爬虫似乎只是爬取了起始页面上第一个标题链接的数据。这可能是因为在parse函数中只获取了第一个数据块，而没有对其他数据块进行处理。你可以尝试使用循环迭代数据块，以便对每个数据块进行相同
scrapy爬取知乎首页乱码
2017-12-01 03:21

回答 2 已采纳 ```python HEADERS = { 'Host': 'www.zhihu.com', 'Accept': 'text/html,application/xhtml+xml
scrapy-爬取京东笔记本电脑信息问题 chrome python selenium 开发语言
2020-09-01 19:12

回答 2 已采纳 ``` browser.quit() return HtmlResponse(url=request.url, body=browser.page_source, re
谷歌插件webscraper使用问疑难杂症（插件页面跑到右边+爬取内容乱序+自定义选择多个列表+滚动抓取社交发帖+select鼠标无法选中元素+无法识别表格+插件支持范围+爬取数据与原始顺序不一致+）
2022-08-30 22:54

anmu4200的博客谷歌插件webscraper使用问疑难杂症（插件页面跑到右边+爬取内容乱序+自定义选择多个列表+滚动抓取社交发帖+select鼠标无法选中元素+无法识别表格+插件支持范围+爬取数据与原始顺序不一致+）
Bajaj_FinSearch-hackrx2.0:解决 Bajaj HackRx 的问题陈述 1
2021-08-04 13:17

Bajaj FinSearch 快速的结果，总是。 Bajaj HackRx 问题陈述 1 的解决方案 - 网站搜索引擎搜索结果相关查询上方的推广内容 ...运行 ElasticSearch，然后使用$ scrapy crawl hackrx运行 Scraper 运行前端
Chrome插件 WEB 网页数据采集和爬虫程序_chrome插件爬虫开发(1)
2024-04-28 18:10

m0_60607245的博客我们在看视频学习的时候，不能光动眼动脑不动手，比较科学的学习方法是在理解之后运用它们，这时候练手项目就很适合了，只是里面的...基本上主流的和经典的都有，这里我就不放图了，版权问题，个人看看是没有问题的。
Chrome插件 WEB 网页数据采集和爬虫程序_chrome插件爬虫开发
2024-04-28 18:11

m0_60607289的博客不知道你们用的什么环境，我一般都是用的Python3.6环境和pycharm解释器，没有软件，或者没有资料，没人解答问题，都可以免费领取（包括今天的代码），过几天我还会做个视频教程出来，有需要也可以领取~给大家准备的...
没有解决我的问题, 去提问

悬赏问题

¥15 写uniapp时遇到的问题
¥15 matlab有限元法求解梁带有若干弹簧质量系统的固有频率
¥15 找一个网络防御专家，外包的
¥100 能不能让两张不同的图片md5值一样，（有尝）
¥15 informer代码训练自己的数据集，改参数怎么改
¥15 请看一下，学校实验要求，我需要具体代码
¥50 pc微信3.6.0.18不能登陆有偿解决问题
¥20 MATLAB绘制两隐函数曲面的交线
¥15 求TYPCE母转母转接头24PIN线路板图
¥100 国外网络搭建，有偿交流

Scrapy Scraper问题

3条回答 默认 最新

悬赏问题

3条回答默认最新