Firstcry.com刮板问题

I am trying to scrape the following site - www.firstcry.com . The website uses AJAX (in the form of XHR) to display it's search results.

Now, if you see my code, the jsonresponse variable contains the json output of the website. Now, when I try to print it, it contains many \ (backslashes).

Now, if you correctly see my code just below the jsonresponse variable, I have commented several lines. Those were my attempts (which I tried after reading several similar questions, here on Stack Overflow) to remove all the backslashes and also these - u', which were also present there.

But, after all those tries, I am unable to remove ALL the backslashes and u'.

Now, if I don't remove all of those, I am not able to access the jsonresponse using it's keys, So, it is very essential for me to remove ALL of them.

Please help me resolve this issue. It would be better, if you provide a code , in particular for my case (issue), and not a general code rather!

My Code is here-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import json , simplejson , ujson

#query=raw_input("Enter a product to search for= ")
query='bag'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.firstcry.com"]


    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp = "http://www.firstcry.com/svcs/search.svc/GetSearchPagingProducts_new?PageNo=" + str(i) + "&PageSize=20&SortExpression=Relevance&SubCatId=&BrandId=&Price=&OUTOFSTOCK=&DISCOUNT=&Q=" + query1 + "&rating="
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        return [ Request(url = start_url) for start_url in start_urls ]


    def parse(self, response):
        print response

        items = []
        jsonresponse = dict(ujson.loads(response.body_as_unicode()))
#       jsonresponse = jsonresponse.replace("\\","")
#       jsonresponse = jsonresponse.decode('string_escape')
#       jsonresponse = ("%r" % json.loads(response.body_as_unicode()))
#       d= jsonresponse.json()
        #jsonresponse = jsonresponse.strip("/")
#       print jsonresponse
#       print d
#       print json.dumps("%r" % jsonresponse, indent=4, sort_keys=True)
#       a = simplejson.dumps(simplejson.loads(response.body_as_unicode()).replace("u\'","\'"), indent=4, sort_keys=True)
        #a= json.dumps(json.JSONDecoder().decode(jsonresponse))
        #a = ujson.dumps((ujson.loads(response.body_as_unicode())) , indent=4 )
        a=json.dumps(jsonresponse, indent=4)
        a=a.decode('string_escape')
        a=(a.decode('string_escape'))
#       a.gsub('\\', '')
        #a = a.strip('/')
        #print (jsonresponse)
        print a
        #print "%r" % a
#       print "%r" % json.loads(response.body_as_unicode())

        p=(jsonresponse["hits"])["hit"]
#       print p
#       raw_input()
        for x in p:
            item = DmozItem()
            item['productname'] = str(x['title'])
            item['product_link'] = "http://www.yepme.com/Deals1.aspx?CampId="+str(x["uniqueId"])
            item['current_price']='Rs. ' + str(x["price"])

            try:            
                p=x["marketprice"]
                item['mrp'] = 'Rs. ' + str(p)

            except:
                item['mrp'] = item['current_price']

            try:            
                item['offer'] = str(x["promotionalMsg"])
            except:
                item['offer'] = str('No additional offer available')

            item['imageurl'] = "http://staticaky.yepme.com/newcampaign/"+str(x["uniqueId"])[:-1]+"/"+str(x["smallimage"])
            item['outofstock_status'] = str('In Stock')
            items.append(item)

        print (items)

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("CONCURRENT_REQUESTS" , 100)
#)
settings.set( "DEPTH_PRIORITY" , 1)
settings.set("SCHEDULER_DISK_QUEUE" , "scrapy.squeues.PickleFifoDiskQueue")
settings.set( "SCHEDULER_MEMORY_QUEUE" , "scrapy.squeues.FifoMemoryQueue")
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
weixin_33691817 2015-07-09 13:12
关注
No need to get all complicated. Instead of using ujson and response.body_as_unicode() and then casting that into a dict, just use regular json and response.body:

$ scrapy shell "http://www.firstcry.com/svcs/search.svc/GetSearchPagingProducts_new?PageNo=1&PageSize=20&SortExpression=Relevence&SubCatId=&BrandId=&Price=&OUTOFSTOCK=&DISCOUNT=&Q=bag&rating=" ... >>> jsonresponse = json.loads(response.body) >>> jsonresponse.keys() [u'ProductResponse']

This worked just fine for me with your example. Looks like you're a bit deep into the "hacking around for an answer" mode ;)

I'll note that this line...

p=(jsonresponse["hits"])["hit"]

... won't work in your code. The only key available in jsonresponse after parsing the JSON is "ProductResponse". That key contains another JSON object, which you can then access like this:

>>> product_response = json.loads(jsonresponse['ProductResponse']) >>> product_response['hits']['hit'] [{u'fields': {u'_score': u'56.258633', u'bname': u'My Milestones', u'brandid': u'450', ...

I think that will give you what you were looking to get in your p variable.
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

确实是刮板：Indeed.com刮板
2021-02-15 18:39

此外，考虑到Indeed.com可能有反爬策略，比如验证码、IP限制或者User-Agent检查，刮板可能还需要处理这些问题。这可能涉及到设置延时请求、使用代理IP、随机User-Agent或者登录模拟等技术。在实际使用"确实刮板"时...
zhidao_scrapy:zhidao.baidu.com刮板项目
2021-05-21 06:51

zhidao_scrapyzhidao.baidu.com scrapy project知道问答爬虫测试说明scrapy shell进入debug模式，测试页面解析部分scrapy crawl question执行爬虫测试项目
glassdoor-scrapper:Glassdoor.com评分刮板
2021-04-30 13:08

Glassdoor.com刮板该抓取工具允许从glassdoor.com收集有关公司的信息。数据保存在JSON文件中，以允许进一步处理（例如，过滤，排序）。用法 $ npm install $ ./bin/glassdoor 证书所有命令都需要带有glassdoor...
c9.io-scraper:c9.io刮板
2021-05-03 02:12

git clone https://github.com/remarkablemark/c9.io-scraper.git cd c9.io-scraper 如果使用，则可以设置节点版本： nvm use 安装依赖项： npm install 环境变量在运行刮板之前填写.env ： USERNAME=user # ...
tenki-no-ko:tenki.jp刮板模块
2021-05-27 08:23

"天树之子：tenki.jp刮板模块"是一个专门针对tenki.jp网站进行数据抓取的工具。在IT行业中，刮板（Scraper）通常指的是编写特定的代码或使用工具来自动从网页中提取所需信息的程序。在这个案例中，模块主要关注的是...
ffbb-api:www.ffbb.com 的刮板
2021-05-30 12:47

站点的刮板。用法安装 npm install 获得帮助 > ./ffbb -h 获取许可证号码 > ./ffbb licence -h > ./ffbb licence -n Parker -p Tony 1 licencié trouvé N° national N° licence Nom Prénom Sexe Date ...
example-clinicaltrials-scrapy:Clinicaltrials.gov刮板样板
2021-05-07 15:26

用于临床试验的刮板样板。基于Scrapy框架的Clinicaltrials.gov刮板样板。先决条件须藤apt-get install libffi-dev 须藤apt-get install libxslt1-dev libxslt1.1 libxml2-dev libxml2 libssl-dev 快速开始 ...
mci-scraper:使用Selenium的MCIIndia.org刮板机
2021-04-29 01:54

微刮板使用Selenium的MCI印度刮板机如何开始使用它克隆仓库cd && git clone https://github.com/akul08/mci-scraper.git && cd mci-scraper 设置virtualenv virtualenv venv source venv/bin/activate pip install -...
InsomniaScraper:GOG.com 双重失眠交易刮板
2021-06-21 11:54

失眠刮板 GOG.com 双重失眠交易刮板 GOG.com 2015 年 3 月 Double Insomnia 促销活动的新优惠提醒敬请关注《荒原》和《魔石传说 2》的优惠！用法： npm start
刮板前端
2021-02-19 07:02

模因刮板项目的前端模因刮板项目的前端后端-https: 页面导航在网站上，您可以选择2个带有模因的网站，您可以在它们之间切换。您还可以选择单个模因并将链接发送给您的朋友。每个模因都有自己的唯一ID 描述 ...
addic7ed:addic7ed.com的刮板
2021-04-28 12:28

＃Addic7ed尚未开发＃执照 The MIT License (MIT) ...Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to ...
Quiz-Scraper:Britannica.com的刮板使用Selenium测验
2021-05-09 17:47

Britannica.com的刮板使用Selenium测验要求 Python3 Selenium 壁虎驱动程序（Firefox）去做使用简单循环使事情正常进行将输出保存到文本文件将输出保存到sqlite .db 让用户指定测验页面让用户指定测验...
刮板机说明书资料.doc
2022-01-11 17:58

刮板输送机是一种广泛应用在矿山开采、煤炭运输中的机械设备，主要用于水平或小角度倾斜的连续输送散状物料。本文将详细介绍SGB620/40T型刮板输送机的相关知识点。首先，安全警示是操作刮板输送机时的重要注意事项...
刮板取料机俯仰机构关键问题设计研究
2020-07-04 04:16

介绍了刮板取料机俯仰机构的组成和布置特点,阐述了俯仰机构设计的关键问题。提供了驱动功率计算方法,指导合理选取驱动单元。介绍了钢丝绳的计算选型,针对滑轮和卷筒的钢丝绳偏斜角要求,提出滑轮位置布置的优化设计。...
morph.io-nodejs-boilerplate:用于morph.io的样板Node.js刮板
2021-05-21 09:06

可以使用`console.log`进行调试，或者配置日志记录，以便跟踪和分析问题。 6. **部署到morph.io**: 当刮刀功能完善且测试通过后，将项目上传到morph.io平台。这通常涉及将项目打包成ZIP文件，然后在morph.io网站上...
nsw_gov_docs:用于提交新南威尔士州议会文档的 morph.io 刮板
2021-07-21 07:17

这是一个在上运行的刮板。要开始，这个刮板旨在收集上列出的提交文件清单。为什么？银河系漫游指南说得最好： “但是丹特先生，过去九个月里，当地规划办公室一直在提供这些计划。” “哦，是的，昨天下午我...
刮板输送机设计.rar
2025-05-19 07:52

设计过程中还要考虑到链条的使用寿命、刮板的磨损程度、以及整个输送机的效率问题。在机械结构设计方面，需要对各个零部件的尺寸和形状进行精确计算和设计，确保其具有足够的强度和刚度。同时，要考虑到便于制造、...
MS400 刮板式输送机.rar
2025-05-15 08:23

MS400刮板式输送机是一种工业输送设备，广泛应用于各种物料的输送场合。其工作原理是利用刮板链条在固定槽体内的循环运动，将物料从一端输送到另一端。这种输送方式特别适合于潮湿、粘性或有磨损性的物料输送，因为...
没有解决我的问题, 去提问

Firstcry.com刮板问题

1条回答 默认 最新

1条回答默认最新