weixin_33675507 2015-07-05 06:50 采纳率: 0%
浏览 78

Scrapy Scraper问题

I am trying to use Scrapy to scrape - www.paytm.com . The website uses AJAX Requests, in the form of XHR to display search results.

I managed to track down the XHR, and the AJAX response is SIMILAR to JSON, but it isn't actually JSON.

This is the link for one of the XHR request - https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6 . If you see the URL correctly, The parameter - page_count - is responsible for showing different pages of results, and the parameter - userQuery - is responsible for the search query that is passed to the website.

Now, if you see the response correctly. It isn't actually JSON, only looks similar to JSON ( I veified it on http://jsonlint.com/ ) . I want to scrape this using SCRAPY ( SCRAPY only because since it is a framework, it would be faster than using other libraries like BeautifulSoup, because using them to create a scraper that scrapes at such a high speed would take a lot effort - That is the only reason why I want to use Scrapy. ) .

Now, This is my snippet of code, that I used to extract the JSON Response from the URL -:

    jsonresponse = json.loads(response.body_as_unicode())
    print json.dumps(jsonresponse, indent=4, sort_keys=True)

On executing the code, it throws me an error stating-:

2015-07-05 12:13:23 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-05 12:13:23 [scrapy] INFO: Optional features available: ssl, http11
2015-07-05 12:13:23 [scrapy] INFO: Overridden settings: {'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'CONCURRENT_REQUESTS': 100}
2015-07-05 12:13:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-05 12:13:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-05 12:13:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-05 12:13:23 [scrapy] INFO: Enabled item pipelines: 
2015-07-05 12:13:23 [scrapy] INFO: Spider opened
2015-07-05 12:13:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-05 12:13:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-05 12:13:24 [scrapy] DEBUG: Crawled (200) <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
2015-07-05 12:13:24 [scrapy] ERROR: Spider error processing <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "Startup App/SCRAPERS/paytmscraper_scrapy/paytmspiderscript.py", line 111, in parse
    jsonresponse = json.loads(response.body_as_unicode())
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2015-07-05 12:13:24 [scrapy] INFO: Closing spider (finished)
2015-07-05 12:13:24 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 343,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6483,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 5, 6, 43, 24, 733187),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2015, 7, 5, 6, 43, 23, 908135)}
2015-07-05 12:13:24 [scrapy] INFO: Spider closed (finished)

Now, my Question, How do I scrape such a response using Scrapy? If any other code is required, feel free to ask in the comments. I shall willingly give it!

Please provide the entire code related to this. It would be well appreciated! Maybe some manipulation of the JSON Response (from python) (similar to string comparison) would also work for me, if it can help me scrape this!

P.S: I can't modify the JSON Response manually (using hand) every time because this is the response that is given by the website. So, please suggest a programmatic (pythonic) way to do this. Preferably, I want to use Scrapy as my framework.

  • 写回答

3条回答 默认 最新

  • weixin_33676492 2015-07-05 09:59
    关注

    If you look at the not-JSON result it is clear that it contains a JSON.

    If you remove from the response the typeof angular.callbacks._6 === "function" && angular.callbacks._6( initial part and ); at the end you get a valid JSON which you can validate with JSONLint.

    Eventually the solution is to find the first and last occurrence of { and } respectively, in the response and extract the text inside (inclusive of those curly brackets) and use that with json.loads instead of the whole result.

    评论

报告相同问题?

悬赏问题

  • ¥15 写uniapp时遇到的问题
  • ¥15 matlab有限元法求解梁带有若干弹簧质量系统的固有频率
  • ¥15 找一个网络防御专家,外包的
  • ¥100 能不能让两张不同的图片md5值一样,(有尝)
  • ¥15 informer代码训练自己的数据集,改参数怎么改
  • ¥15 请看一下,学校实验要求,我需要具体代码
  • ¥50 pc微信3.6.0.18不能登陆 有偿解决问题
  • ¥20 MATLAB绘制两隐函数曲面的交线
  • ¥15 求TYPCE母转母转接头24PIN线路板图
  • ¥100 国外网络搭建,有偿交流