I am trying to use Scrapy to scrape - www.paytm.com . The website uses AJAX Requests, in the form of XHR to display search results.
I managed to track down the XHR, and the AJAX response is SIMILAR to JSON, but it isn't actually JSON.
This is the link for one of the XHR request - https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6 . If you see the URL correctly, The parameter - page_count - is responsible for showing different pages of results, and the parameter - userQuery - is responsible for the search query that is passed to the website.
Now, if you see the response correctly. It isn't actually JSON, only looks similar to JSON ( I veified it on http://jsonlint.com/ ) . I want to scrape this using SCRAPY ( SCRAPY only because since it is a framework, it would be faster than using other libraries like BeautifulSoup, because using them to create a scraper that scrapes at such a high speed would take a lot effort - That is the only reason why I want to use Scrapy. ) .
Now, This is my snippet of code, that I used to extract the JSON Response from the URL -:
jsonresponse = json.loads(response.body_as_unicode())
print json.dumps(jsonresponse, indent=4, sort_keys=True)
On executing the code, it throws me an error stating-:
2015-07-05 12:13:23 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-05 12:13:23 [scrapy] INFO: Optional features available: ssl, http11
2015-07-05 12:13:23 [scrapy] INFO: Overridden settings: {'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'CONCURRENT_REQUESTS': 100}
2015-07-05 12:13:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-05 12:13:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-05 12:13:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-05 12:13:23 [scrapy] INFO: Enabled item pipelines:
2015-07-05 12:13:23 [scrapy] INFO: Spider opened
2015-07-05 12:13:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-05 12:13:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-05 12:13:24 [scrapy] DEBUG: Crawled (200) <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
2015-07-05 12:13:24 [scrapy] ERROR: Spider error processing <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "Startup App/SCRAPERS/paytmscraper_scrapy/paytmspiderscript.py", line 111, in parse
jsonresponse = json.loads(response.body_as_unicode())
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2015-07-05 12:13:24 [scrapy] INFO: Closing spider (finished)
2015-07-05 12:13:24 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 343,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6483,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 5, 6, 43, 24, 733187),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2015, 7, 5, 6, 43, 23, 908135)}
2015-07-05 12:13:24 [scrapy] INFO: Spider closed (finished)
Now, my Question, How do I scrape such a response using Scrapy? If any other code is required, feel free to ask in the comments. I shall willingly give it!
Please provide the entire code related to this. It would be well appreciated! Maybe some manipulation of the JSON Response (from python) (similar to string comparison) would also work for me, if it can help me scrape this!
P.S: I can't modify the JSON Response manually (using hand) every time because this is the response that is given by the website. So, please suggest a programmatic (pythonic) way to do this. Preferably, I want to use Scrapy as my framework.