scrapy爬取图片url爬取不到 5C

爬取不到网页图片的下载地址,别的id和name都可以得到
不知道是不是正则表达式的问题

爬取网站链接:https://www.ssense.com/en-cn/women?q=top

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from ssense.items import SsenseItem
import  re

class SsensePicSpider(scrapy.Spider):
    name = 'ssense_pic'
    allowed_domains = ['ssense.com']
    start_urls = ['http://ssense.com/']

    def parse(self, response):#定义解析函数
        search_word = 'top'#查找词,可修改
        for i in range(1, 2):#爬取所有网页
            url = 'http://www.ssense.com/en-cn/women?q=' + str(search_word) + '&page=' + str(i)
            #print(url)
            yield Request(url=url, callback=self.page)
        pass

    # 爬取商品url
    def page(self, response):
        body = response.body.decode('utf-8', 'ignore')
        url_id = '"url":\s"([/a-z-0-9]*)"'
        item_id = re.compile(url_id).findall(body)  #获取商品url
        #print(item_id)
        for i in range(0, len(item_id)):
            this_id = item_id[i]
            website = 'https://www.ssense.com/en-cn' + str(this_id)  # 商品链接
            yield Request(url=website, callback=self.next)
            pass
        pass

    def next(self, response):
        item = SsenseItem()
        body = response.body.decode('utf-8', 'ignore')
        # 获取商品productID
        pro_id = '"productID":\s(\d{7})'
        productID = re.compile(pro_id).findall(body)
        item['productID'] = productID


        #获取商品name
        item_name = '"name":\s"([a-zA-Z -]*)"[,]'
        name = re.compile(item_name).findall(body)
        item['name'] = name

        #获取商品price
        item_price = '"price":\s([0-9]*)'
        price = re.compile(item_price).findall(body)
        item['price'] = price

        # 获取sku
        item_sku = '"sku":\s"([0-9A-Z]*)",'
        sku = re.compile(item_sku).findall(body)
        item['sku'] = sku

        #获取图片url
        item_image = '"image":\s"([a-z:/.0-9A-F_-]*)"'
        image = re.compile(item_image).findall(body)
        item['image'] = image
        print(type(image))

        yield item

    pass

1个回答

图片应该是 https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_1086,w_724/c_scale,h_480/f_auto,dpr_1.0/201071F110010_1.jpg

data-srcset="后面的",不知道你的 image: 这个是什么鬼。

<picture data-v-60b7d3e3=""><source data-v-60b7d3e3="" data-srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_1086,w_724/c_scale,h_480/f_auto,dpr_1.0/201071F110010_1.jpg" media="(min-width: 1025px)" srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_1086,w_724/c_scale,h_480/f_auto,dpr_1.0/201071F110010_1.jpg"><source data-v-60b7d3e3="" data-srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_320/f_auto,dpr_1.0/201071F110010_1.jpg" media="(min-width: 768px)" srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_320/f_auto,dpr_1.0/201071F110010_1.jpg"><img data-v-60b7d3e3="" data-srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_280/f_auto,dpr_1.0/201071F110010_1.jpg" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAXUAAAIwAQMAAABDTmnJAAAAA1BMVEUAAACnej3aAAAAAXRSTlMAQObYZgAAADFJREFUeNrtwTEBAAAAwiD7pzbDfmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANEBaQAAAZUbkzMAAAAASUVORK5CYII=" alt="Live the Process - Grey Seamless Sport Top" class="product-thumbnail lazyloaded" srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_280/f_auto,dpr_1.0/201071F110010_1.jpg"></picture>
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
如何利用scrapy爬取带标签的网页内容并保存到自己的服务器上?

如何利用scrapy爬取整个网页的内容并将内容保存到自己的服务器上? 现在我想到了两种方式: 1、直接把scrapy爬取到的字符串通过SQLAlchemy保存到mysql数据库。 这种方式我试过但是不知道是不是容量受限制的原因没有保存成功。(爬取到的其他字段都能保存成功,只有这个保存带标签的网页内容的字段没有保存成功。) 2、在自己的服务器上搭建一个ftp服务器。 将爬取到的网页保存到自己的服务器,在mysql中只保存网页在ftp中的路径。 这种方式还没试过,有点不知道怎么操作。 此外还有一个问题需要解决,爬取到的网页中会有一些图文混排的内容,对于这些图片应该怎么处理呢?我想把网页中引用的图片的url改成自己服务器上的地址, 这个操作应该怎么进行呢。 (现在脑子里很乱,请各位大神指教,上代码、提供思路或者推荐参考资料都行。拜托大家了,感谢感谢,撒花撒花~)

为什么我用scrapy爬取谷歌应用市场却爬取不到内容?

我想用scrapy爬取谷歌应用市场,代码没有报错,但是却爬取不到内容,这是为什么? ``` # -*- coding: utf-8 -*- import scrapy # from scrapy.spiders import CrawlSpider, Rule # from scrapy.linkextractors import LinkExtractor from gp.items import GpItem # from html.parser import HTMLParser as SGMLParser import requests class GoogleSpider(scrapy.Spider): name = 'google' allowed_domains = ['https://play.google.com/'] start_urls = ['https://play.google.com/store/apps/'] ''' rules = [ Rule(LinkExtractor(allow=("https://play\.google\.com/store/apps/details",)), callback='parse_app', follow=True), ] ''' def parse(self, response): selector = scrapy.Selector(response) urls = selector.xpath('//a[@class="LkLjZd ScJHi U8Ww7d xjAeve nMZKrb id-track-click"]/@href').extract() link_flag = 0 links = [] for link in urls: links.append(link) for each in urls: yield scrapy.Request(links[link_flag], callback=self.parse_next, dont_filter=True) link_flag += 1 def parse_next(self, response): selector = scrapy.Selector(response) app_urls = selector.xpath('//div[@class="details"]/a[@class="title"]/@href').extract() print(app_urls) urls = [] for url in app_urls: url = "http://play.google.com" + url print(url) urls.append(url) link_flag = 0 for each in app_urls: yield scrapy.Request(urls[link_flag], callback=self.parse_app, dont_filter=True) link_flag += 1 def parse_app(self, response): item = GpItem() item['app_url'] = response.url item['app_name'] = response.xpath('//div[@itemprop="name"]').xpath('text()').extract() item['app_icon'] = response.xpath('//img[@itempro="image"]/@src') item['app_developer'] = response.xpath('//') print(response.text) yield item ``` terminal运行信息如下: ``` BettyMacbookPro-764:gp zhanjinyang$ scrapy crawl google 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: gp) 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.1 (default, Dec 14 2018, 13:28:58) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Darwin-18.5.0-x86_64-i386-64bit 2019-11-12 08:46:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'gp', 'NEWSPIDER_MODULE': 'gp.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['gp.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'} 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet Password: b2d7dedf1f4a91eb 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled item pipelines: ['gp.pipelines.GpPipeline'] 2019-11-12 08:46:45 [scrapy.core.engine] INFO: Spider opened 2019-11-12 08:46:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-12 08:46:45 [py.warnings] WARNING: /anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://play.google.com/ in allowed_domains. warnings.warn(message, URLWarning) 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-12 08:46:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/robots.txt> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/store/apps/> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-12 08:46:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 810, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 232419, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 12, 8, 46, 46, 474543), 'log_count/DEBUG': 2, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'memusage/max': 58175488, 'memusage/startup': 58175488, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2019, 11, 12, 8, 46, 45, 562775)} 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Spider closed (finished) ``` 求助!!!

python scrapy 爬取多页合并问题

scrapy学习有几个月了,普通scrapy和crawl都能够实现,现在碰到一个问题: 在使用scrapy爬取多分页后,如何把多分页内容合并写入到一个item[x]内? 我现在使用 yield Request 至 def art_url 来获取分页内容,用append把内容集合后,用 item['image_urls'] = self.art_urls 来接收结果, 但结果一直接收,每篇内容的分页的接收导致很多,请教一下,如何把每篇的分页内容合并写入一项itme? 刚学不到半年,代码凌乱,望包含,主要是想学习如何爬取小说站,把每一章都合并在一起,不要分页搞很多数据,和合适代码推荐下,研究学习,谢谢了 我的代码: ``` art_urls = [] rules = ( Rule(LinkExtractor(allow='wenzhang/',restrict_xpaths=('//table[@id="dlNews"]')), callback='parse_item', follow=True), ) def parse_item(self, response): print(response.url) item = SpiderItem() conn = Redis(host='127.0.0.1', port=6379) item['title'] = response.xpath('//h1/text()').extract_first() ex = conn.sadd('movies_url', response.url) for next_href in response.xpath('//div[@class="pager"]/ul/li/a/@href').extract(): next_url = self.base_url + next_href.replace('../','') if ex == 1: # print('开始解析单页') yield Request(next_url, callback=self.art_url) # yield scrapy.Request(url=next_url, callback=self.parse_detail, meta={'title': title,'img_src':img_src}) else: print("无数据更新!!!") # print(self.art_urls) item['image_urls'] = self.art_urls # print(len(item['image_urls'])) # print(item) yield item def art_url(self, response): art_urls = response.xpath('//div[@id="content"]/div/p/img/@src').extract() for art_url in art_urls: # 开始解析分页 art_url = art_url.replace('../../upload/','') self.art_urls.append(art_url) ```

Python scrapy爬取网页解码问题

尝试爬取淘宝网页,在parse中解析response希望获得解码后的文本 使用response.text 会在log中记录有报错:有无法解码的信息 使用response.body.decode('utf-8','ignore')也会出现同样的问题 使用response.xpath('xxxxxx').extract()可以获取相关信息 但是希望使用正则表达式进行检索,希望大神帮助,如何能过跳过那些不规则的编码获取网页文本

scrapy爬取知乎首页乱码

爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下: ``` import scrapy HEADERS = { 'Host': 'www.zhihu.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'Origin': 'https://www.zhihu.com', 'Referer': 'https://www.zhihu.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, headers=HEADERS) def parse(self, response): print('========== parse ==========') print(response.text[:100]) body = response.body encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1'] for encoding in encodings: try: print('========== decode ' + encoding) print(body.decode(encoding)[:100]) print('========== decode end\n') except Exception as e: print('########## decode {0}, error: {1}\n'.format(encoding, e)) pass ``` 输出的log如下: D:\workspace_python\ZhihuSpider>scrapy crawl zhihu 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider) 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']} 2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened 2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com/) ========== parse ========== ��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"�� z�I�zLgü^�1� Q)Ա�_k}�䄍���/T����U�3���l��� ========== decode utf-8 ########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte ========== decode gbk ########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode gb2312 ########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode iso-8859-1 áø~!¢ 同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。

Scrapy爬取下来的数据不全,为什么总会有遗漏?

本人小白一枚,刚接触Scrapy框架没多久,写了一个简单的Spider,但是发现每一次爬取后的结果都比网页上的真实数据量要少,比如网站上一共有100条,但我爬下来的结果一般会少几条至几十条不等,很少有100条齐的时候。 整个爬虫有两部分,一部分是页面的横向爬取(进入下一页),另一个是纵向的爬取(进入页面中每一产品的详细页面)。之前我一直以为是pipelines存储到excel的时候数据丢失了,后来经过Debug调试,发现是在Spider中,数据就遗漏了,def parse函数中的item数量是齐的,包括yield Request加入到队列中,但是调用def parse_item函数时,就有些产品的详细页面无法进入。这是什么原因呢,是因为Scrapy异步加载受网速之类的影响么,本身就有缺陷,还是说是我设计上面的问题?有什么解决的方法么,不然数据量一大那丢失的不是就很严重么。 求帮助,谢谢各位了。 ``` class MyFirstSpider(Spider): name = "MyFirstSpider" allowed_doamins = ["e-shenhua.com"] start_urls = ["https://www.e-shenhua.com/ec/auction/oilAuctionList.jsp?_DARGS=/ec/auction/oilAuctionList.jsp"] url = 'https://www.e-shenhua.com/ec/auction/oilAuctionList.jsp' def parse(self, response): items = [] selector = Selector(response) contents = selector.xpath('//table[@class="table expandable table-striped"]/tbody/tr') urldomain = 'https://www.e-shenhua.com' for content in contents: item = CyfirstItem() productId = content.xpath('td/a/text()').extract()[0].strip() productUrl = content.xpath('td/a/@href').extract()[0] totalUrl = urldomain + productUrl productName = content.xpath('td/a/text()').extract()[1].strip() deliveryArea = content.xpath('td/text()').extract()[-5].strip() saleUnit = content.xpath('td/text()').extract()[-4] item['productId'] = productId item['totalUrl'] = totalUrl item['productName'] = productName item['deliveryArea'] = deliveryArea item['saleUnit'] = saleUnit items.append(item) print(len(items)) # **************进入每个产品的子网页 for item in items: yield Request(item['totalUrl'],meta={'item':item},callback=self.parse_item) # print(item['productId']) # 下一页的跳转 nowpage = selector.xpath('//div[@class="pagination pagination-small"]/ul/li[@class="active"]/a/text()').extract()[0] nextpage = int(nowpage) + 1 str_nextpage = str(nextpage) nextLink = selector.xpath('//div[@class="pagination pagination-small"]/ul/li[last()]/a/@onclick').extract() if (len(nextLink)): yield scrapy.FormRequest.from_response(response, formdata={ *************** }, callback = self.parse ) # 产品子网页内容的抓取 def parse_item(self,response): sel = Selector(response) item = response.meta['item'] # print(item['productId']) productInfo = sel.xpath('//div[@id="content-products-info"]/table/tbody/tr') titalBidQty = ''.join(productInfo.xpath('td[3]/text()').extract()).strip() titalBidUnit = ''.join(productInfo.xpath('td[3]/span/text()').extract()) titalBid = titalBidQty + " " +titalBidUnit minBuyQty = ''.join(productInfo.xpath('td[4]/text()').extract()).strip() minBuyUnit = ''.join(productInfo.xpath('td[4]/span/text()').extract()) minBuy = minBuyQty + " " + minBuyUnit isminVarUnit = ''.join(sel.xpath('//div[@id="content-products-info"]/table/thead/tr/th[5]/text()').extract()) if(isminVarUnit == '最小变量单位'): minVarUnitsl = ''.join(productInfo.xpath('td[5]/text()').extract()).strip() minVarUnitdw = ''.join(productInfo.xpath('td[5]/span/text()').extract()) minVarUnit = minVarUnitsl + " " + minVarUnitdw startPrice = ''.join(productInfo.xpath('td[6]/text()').extract()).strip().rstrip('/') minAddUnit = ''.join(productInfo.xpath('td[7]/text()').extract()).strip() else: minVarUnit = '' startPrice = ''.join(productInfo.xpath('td[5]/text()').extract()).strip().rstrip('/') minAddUnit = ''.join(productInfo.xpath('td[6]/text()').extract()).strip() item['titalBid'] = titalBid item['minBuyQty'] = minBuy item['minVarUnit'] = minVarUnit item['startPrice'] = startPrice item['minAddUnit'] = minAddUnit # print(item) return item ```

python爬虫scrapy爬取了数据无法写入json

用scrapy成功爬取了商品数据,但是到目录下却发现数据文件没有创建,郁闷。。pipelines文件代码如下 ``` import codecs import json class AutopjtPipeline(object): def _int_(self): self.file=codecs.open("77.json","wb",encoding="utf-8") def process_item(self, item, spider): for j in range(0,len(item["name"])): name = item["name"][j] price=item["price"][j] comnum = item["comnum"][j] link = item["link"][j] # 将当前页下第j个商品的name、price、comnum、link等信息处理一下,重新组合成一个字典 goods = {"name": name, "price": price, "comnum": comnum, "link": link} # 将组合后的当前页中第j个商品的数据写入json文件 i = json.dumps(dict(goods), ensure_ascii=False) line = i + '\n' self.file.write(line) # 返回item return item def close_spider(self,spider): self.file.close() ``` 同时报错 Traceback (most recent call last): File "c:\users\93422\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "C:\Users\93422\Desktop\python\autopjt\autopjt\pipelines.py", line 28, in close_spider self.file.close() AttributeError: 'AutopjtPipeline' object has no attribute 'file' items文件代码以及爬虫代码都基本没问题,爬虫代码如下 ```import scrapy from autopjt.items import AutopjtItem from scrapy.http import Request class AutospdSpider(scrapy.Spider): name = 'autospd' allowed_domains = ['dangdang.com'] start_urls = ['http://category.dangdang.com/pg1-cid4003872-srsort_sale_amt_desc.html' ] def parse(self, response): item=AutopjtItem() item['name']=response.xpath("//p[@class='name']/@title").extract() item['price']=response.xpath('//span[@class="price_n"]/text()').extract() item['link']=response.xpath('//p[@class="name"]/@href').extract() item['comnum']=response.xpath('//a[@ddclick]/text()').extract() yield item for i in range(1,20): url="http://category.dangdang.com/pg"+str(i)+"-cid4003872-srsort_sale_amt_desc.html" yield Request(url,callback=self.parse) ```

scrapy爬取小说,从详情页获取了书名等信息再爬小说正文,如何做到所有章节被放在一个list里面?

我想要这样的形式: {"book_name": "XXXX", "writer": "XXX", "type": "XXX", "total_click": "XXX", "book_intro": "XXX", "label": ["XX", "XX", "XX", "XX"], "total_word_number": "XX ", "total_introduce": "XX", "week_introduce": "XX", "read_href": "XX", "chapters": [{"name": "第0001章 XX", "word_count": "XX", "time": "XX", "text": "XXXX"},{"name": "第0002章 XX", "word_count": "XX", "time": "XX", "text": "XXXX"},……]} 就像这样 ![图片说明](https://img-ask.csdn.net/upload/201906/12/1560341795_56596.jpg) 但是现在的结果不是章节在一个dict里面而是每章都返回一次item,我知道是哪里的逻辑有问题,但是不会改 代码如下 ``` # -*- coding: utf-8 -*- import scrapy from novel.items import NovelItem from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor import re url_page=1 class NovelSpider(CrawlSpider): name = 'novel' allowed_domains = ['book.zongheng.com'] custom_settings = { "USER_AGENT": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", } start_urls = [] # for i in range(1,2): i = 1 start_urls.append('http://book.zongheng.com/store/c0/c0/b0/u1/p' + str(i) +'/v9/s9/t0/u0/i1/ALL.html') rules = ( Rule(LinkExtractor(allow=r'book/\d+'), callback="parse_detail"), ) def parse_detail(self,response): item = NovelItem() item['book_name'] = response.css('div.book-name::text').extract_first() item['writer'] = response.css("div.au-name a::text").extract_first() item['type'] = response.css( "body > div.wrap > div.book-html-box.clearfix > div.book-top.clearfix > div.book-main.fl > div.book-detail.clearfix > div.book-info > div.book-label > a.label::text").extract_first() item['total_click'] = response.css( "body > div.wrap > div.book-html-box.clearfix > div.book-top.clearfix > div.book-main.fl > div.book-detail.clearfix > div.book-info > div.nums > span:nth-child(3) > i::text").extract_first() item['book_intro'] = response.css( "body > div.wrap > div.book-html-box.clearfix > div.book-top.clearfix > div.book-main.fl > div.book-detail.clearfix > div.book-info > div********.book-dec.Jbook-dec.hide > p::text").extract_first() item['label'] = response.xpath("//div[@class='book-label']/span/a/text()").extract() item['total_word_number'] = response.xpath("//div[@class='nums']/span[1]/i/text()").extract_first() item['total_introduce'] = response.xpath("//div[@class='nums']/span[2]/i/text()").extract_first() item['week_introduce'] = response.xpath("//div[@class='nums']/span[4]/i/text()").extract_first() read_href = response.css("div.btn-group>a::attr(href)").extract_first() if read_href: yield scrapy.Request( read_href, callback=self.parse_content, dont_filter=True, meta={"item": item}, ) def parse_content(self, response): # 处理正文 item = response.meta["item"] chapters = [] chapter_name = response.css("div.title_txtbox::text").extract_first() word_count = response.css("#readerFt > div > div.bookinfo > span:nth-child(2)::text").extract_first() time = response.css("#readerFt > div > div.bookinfo > span:nth-child(3)::text").extract_first() content_link = response.css("div.content") paragraphs = content_link.css("p::text").extract() content_text = "" for i in range(0,len(paragraphs)): content_text = content_text + paragraphs[i] + "\n" content = dict(name=chapter_name,word_count=word_count,time=time,text=content_text) chapters.append(content) item['chapters'] = chapters#应该是这里出了问题,但是我不知道怎么解决 global url_page url_page = url_page+1 next_page = response.css("a.nextchapter::attr(href)").extract_first() if url_page<21: yield scrapy.Request( next_page, callback=self.parse_content, dont_filter = True, meta = {"item": item}, ) # print(chapters) yield item ```

用scrapy爬取谷歌应用市场

我在用scrapy框架爬谷歌应用市场,但是只爬了不到10000个app,有大神解答一下这是为什么吗?应该不是被ban的原因,因为我设置了ua池和代理IP。 具体代码如下: ``` # -*- coding: utf-8 -*- import scrapy # from scrapy.spiders import CrawlSpider, Rule # from scrapy.linkextractors import LinkExtractor # from html.parser import HTMLParser as SGMLParser from scrapy import Request from urllib.parse import urljoin from gp.items import GpItem class GoogleSpider(scrapy.Spider): # print("HELLO STARTING") name = 'google' allowed_domains = ['play.google.com'] start_urls = ['https://play.google.com/store/apps/'] ''' rules = [ Rule(LinkExtractor(allow=("https://play\.google\.com/store/apps/details",)), callback='parse_app', follow=True), ] ''' def parse(self, response): print("Calling Parse") selector = scrapy.Selector(response) urls = selector.xpath('//div[@class="LNKfBf"]/ul/li[@class="CRHL7b eZdLre"]/ul[@class="TEOqAc"]/li[@class="KZnDLd"]/a[@class="r2Osbf"]/@href').extract() print(urls) link_flag = 0 links = [] for link in urls: links.append(link) for each in urls: yield Request(url="http://play.google.com" + links[link_flag], callback=self.parse_more, dont_filter=True) print("http://playgoogle.com" + links[link_flag]) link_flag += 1 def parse_more(self, response): selector = scrapy.Selector(response) # print(response.body) urls = selector.xpath('//a[@class="LkLjZd ScJHi U8Ww7d xjAeve nMZKrb id-track-click "]/@href').extract() link_flag = 0 links = [] for link in urls: # print("LINK" + str(link)) links.append(link) for each in urls: yield Request(url="http://play.google.com" + links[link_flag], callback=self.parse_next, dont_filter=True) # print("http://play.google.com" + links[link_flag]) link_flag += 1 def parse_next(self, response): selector = scrapy.Selector(response) # print(response) # app_urls = selector.xpath('//div[@class="details"]/a[@class="title"]/@href').extract() app_urls = selector.xpath('//div[@class="Vpfmgd"]/div[@class="RZEgze"]/div[@class="vU6FJ p63iDd"]/' 'a[@class="JC71ub"]/@href').extract() urls = [] for url in app_urls: url = "http://play.google.com" + url print(url) urls.append(url) link_flag = 0 for each in app_urls: yield Request(url=urls[link_flag], callback=self.parse_app, dont_filter=True) link_flag += 1 def parse_app(self, response): item = GpItem() item['app_url'] = response.url item['app_name'] = response.xpath('//h1[@itemprop="name"]/span').xpath('text()').get() item['app_icon'] = response.xpath('//img[@itemprop="image"]/@src').get() item['app_rate'] = response.xpath('//div[@class="K9wGie"]/div[@class="BHMmbe"]').xpath('text()').get() item['app_version'] = response.xpath('//div[@class="IQ1z0d"]/span[@class="htlgb"]').xpath('text()').get() item['app_description'] = response.xpath('//div[@itemprop="description"]/span/div').xpath('text()').get() # item['app_developer'] = response.xpath('//') # print(response.text) yield item ``` 另一个问题是我能不能通过定义关键词来爬取特定类型的app呀?如果可以的话那在scrapy中该怎么实现呢? 拜托各位大神帮我解答一下吧!

用anaconda的scrapy爬取数据,按照步骤设置好了,却爬不到数据,求助大神救救菜鸟

这是运行的全部结果: (D:\Anaconda2) C:\Users\luyue>cd C:\Users\luyue\movie250 (D:\Anaconda2) C:\Users\luyue\movie250>scrapy crawl movie250 -o items.json 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: movie250) 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'movie250.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['movie250.spiders'], 'BOT_NAME': 'movie250', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'} 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-05-12 19:24:26 [scrapy.core.engine] INFO: Spider opened 2017-05-12 19:24:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-05-12 19:24:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/robots.txt> (referer: None) 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/top250/> (referer: None) 2017-05-12 19:24:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://movie.douban.com/top250/>: HTTP status code is not handled or not allowed 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Closing spider (finished) 2017-05-12 19:24:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 445, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 496, 'downloader/response_count': 2, 'downloader/response_status_count/403': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 5, 12, 11, 24, 27, 13000), 'log_count/DEBUG': 3, 'log_count/INFO': 8, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 5, 12, 11, 24, 26, 675000)} 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Spider closed (finished)

Scrapy爬取谷歌应用市场

我这样写逻辑有错误吗?为什么在parse\_search函数里取不到href的值呢? ``` # -*- coding: utf-8 -*- import scrapy from GP_Spider.items import GpItem from scrapy import Request class GoogleSpider(scrapy.Spider): name = 'google' allowed_domains = ['google.play.com'] start_urls = ['https://play.google.com/store'] def parse(self, response): keywords = [ 'stuttering', 'speech%20therapy', 'speech%20and%20language%20therapy', 'aphasia', 'apraxia', 'dysarthria' ] link_flag = 0 urls = [] for each in keywords: app_url = ("https://play.google.com/store/search?q=" + keywords[link_flag] + '&c=apps') print(app_url) yield Request(url=app_url, callback=self.parse_search, dont_filter=True) link_flag += 1 def parse_search(self, response): print("START PARSING") selector = scrapy.Selector(response) #print(response.body) urls = selector.xpath('//a[@class="poRVub" and aria-hidden="true"]/@href').extract() #urls = selector.xpath('//*[@id="fcxH9b"]/div[4]/c-wiz/div/div[2]/div/c-wiz/c-wiz/c-wiz/div/div[2]/div[1]/c-wiz/div/div/div[1]/div/div/a/@href').extract() print(urls) link_flag = 0 links = [] for link in urls: links.append(link) for each in urls: yield Request(url="https://play.google.com" + links[link_flag], callback=self.parse_detail, dont_filter=True) print("https://play.google.com" + links[link_flag]) link_flag += 1 def parse_detail(self, response): item = GpItem() item['app_url'] = response.url item['app_name'] = response.xpath('//h1[@itemprop="name"]/span').xpath('text()').get() item['app_icon'] = response.xpath('//img[@itemprop="image"]/@src').get() item['app_rate'] = response.xpath('//div[@class="K9wGie"]/div[@class="BHMmbe"]').xpath('text()').get() item['app_version'] = response.xpath('//div[@class="IQ1z0d"]/span[@class="htlgb"]').xpath('text()').get() item['app_description'] = response.xpath('//div[@itemprop="description"]/span/div').xpath('text()').get() # item['app_developer'] = response.xpath('//') # print(response.text) yield item ``` 这个xpath路径是我自己写的,如果直接从chrome浏览器复制下来的话,就可以爬到特定的那个搜索结果页面的url,但是其他搜索结果页就爬不到,这是为什么? 求教各位大佬

scrapy 爬取遇到问题Filtered duplicate

用scrapy请求站点 http://bigfile.co.kr 的时候,显示Filtered duplicate request:no more duplicates错误,然后就结束了,加上dont_filter=True,重新运行,结果一直死循环,无法结束,也不能爬到东西,有没有大神看一下 ```python name = 'WebSpider' start_urls = ['http://bigfile.co.kr'] headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", 'Referer': 'http://www.baidu.com/', "Upgrade-Insecure-Requests": 1, "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" } def start_requests(self): request = scrapy.Request(url=self.start_urls[0], headers=self.headers, callback=self.parse) request.meta['url'] = self.start_urls[0] yield request ```

scrapy框架爬取结构设计问题

是搭建爬取当当图书的爬虫, 在第一个paser是为了得到所有的分类,比如一年级,二年级等等 第二个paser_next 是为了得到所有页的url 但是我在下一步对每一页的url进行解析时,不知道该字母搭建一个入口 如果放在第一个paser里面,allpag_url_list是空表 如果放在第二个paser_next 里面,则会重复解析 我的想法是能否想办法让paser中执行到yield时停止,直到paser_next解析完毕之后 继续执行语句 但是我并不知道该怎么办才能实现 希望各位指教 附两张图 ![图片说明](https://img-ask.csdn.net/upload/201905/20/1558357011_495528.png) ![图片说明](https://img-ask.csdn.net/upload/201905/20/1558357027_728526.png)

利用Scrapy-Splash爬取淘宝搜索页面被重定向至登录页面

在Splash上用一下代码模拟打开搜索页面: ``` function main(splash, args) args = { url = "https://s.taobao.com/search?q=iPad", wait = 5, page = 5 } splash.images_enabled = false splash:set_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36" assert(splash:go(args.url)) assert(splash:wait(args.wait)) return {png = splash:png()} end ``` 不管有没有设置请求头,都是要**等很久很久**,然后被重定向到**淘宝的登录界面**,实在是想不出来到底是怎么回事了。。。 ![图片说明](https://img-ask.csdn.net/upload/201909/12/1568266170_563233.png) ![图片说明](https://img-ask.csdn.net/upload/201909/12/1568266176_609890.png)

使用python scrapy框架写爬虫如何爬取搜狐新闻的参与人数?

URL如下: http://quan.sohu.com/pinglun/cyqemw6s1/442631551 参与人数该如何爬取,找不到切入点,新手一头雾水…… 非常感谢!!

利用scrapy+redis+bloomfilter爬取b站是爬不到数据是为什么?

1.利用scrapy+redis+bloomfilter爬取b站是爬不到数据是为什么? 2.https://github.com/Mrrrrr10/Bilibili_Spider 3.报错信息 2020-04-29 22:14:59 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: Bilibili_Spider) 2020-04-29 22:14:59 [scrapy.utils.log] INFO: Versions: lxml 4.2.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0 2020-04-29 22:14:59 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Bilibili_Spider', 'CONCURRENT_REQUESTS': 32, 'COOKIES_ENABLED': False, 'DUPEFILTER_CLASS': 'scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter', 'NEWSPIDER_MODULE': 'Bilibili_Spider.spiders', 'RETRY_HTTP_CODES': [401, 403, 407, 408, 414, 500, 502, 503, 504], 'RETRY_TIMES': 10, 'SCHEDULER': 'scrapy_redis_bloomfilter.scheduler.Scheduler', 'SPIDER_MODULES': ['Bilibili_Spider.spiders']} 2020-04-29 22:14:59 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2020-04-29 22:14:59 [bilibili] INFO: Reading start URLs from redis key 'bilibili:start_urls' (batch size: 32, encoding: utf-8 2020-04-29 22:14:59 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'Bilibili_Spider.middlewares.RandomUserAgentMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'Bilibili_Spider.middlewares.BilibiliSpiderDownloaderMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'Bilibili_Spider.middlewares.RandomProxyMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-04-29 22:14:59 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-04-29 22:14:59 [scrapy.middleware] INFO: Enabled item pipelines: ['Bilibili_Spider.pipelines.TimePipeline', 'Bilibili_Spider.pipelines.Bilibili_Pipeline', 'Bilibili_Spider.pipelines.MongoPipeline'] 2020-04-29 22:14:59 [scrapy.core.engine] INFO: Spider opened 2020-04-29 22:14:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-04-29 22:14:59 [bilibili] INFO: Spider opened: bilibili 2020-04-29 22:14:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2020-04-29 22:14:59 [scrapy_redis_bloomfilter.dupefilter] DEBUG: Filtered duplicate request <POST https://space.bilibili.com/ajax/member/GetInfo> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 2020-04-29 22:15:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-04-29 22:16:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

scrapy爬虫不能自动爬取所有页面

学习scrapy第三天,在爬取[wooyun白帽子精华榜](http://wooyun.org/whitehats/do/1/page/1 "")的时候,不能爬取所有的页面。 items.py ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class WooyunrankautoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() ''' 以下信息分别为 注册日期 woyun昵称 精华漏洞数 精华比例 wooyun个人主页 ''' register_date = scrapy.Field() nick_name = scrapy.Field() rank_level = scrapy.Field() essence_count = scrapy.Field() essence_ratio = scrapy.Field() ``` pipelines.py ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import sys import csv class WooyunrankautoPipeline(object): ''' process the item returned from the spider ''' def __init__(self): reload(sys) if sys.getdefaultencoding()!="utf-8": sys.setdefaultencoding("utf-8") file_obj = open("wooyunrank.csv","wb") fieldnames = ["register_date","nick_name","rank_level","essence_count","essence_ratio"] self.dict_writer = csv.DictWriter(file_obj,fieldnames=fieldnames) self.dict_writer.writeheader() def process_item(self,item,spider): self.dict_writer.writerow(item) return item ``` spider.py ```python #!/usr/bin/python # -*- coding:utf-8 -*- import sys from scrapy.spider import Spider from scrapy.selector import Selector from wooyunrankauto.items import WooyunrankautoItem from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors import LinkExtractor class WooyunSpider(CrawlSpider): ''' 爬取wooyun漏洞精华榜单 ''' name = "wooyunrankauto" # 爬取速度为1s download_delay = 2 allowed_domains = ["wooyun.org"] start_urls = [ "http://wooyun.org/whitehats/do/1/page/1" ] rules=[ Rule(LinkExtractor(allow=("/whitehats/do/1/page/\d+")),follow=True,callback='parse_item') ] # def __init__(self): # reload(sys) # if sys.getdefaultencoding()!="utf-8": # sys.setdefaultencoding("utf-8") def parse_item(self,response): sel = Selector(response) infos = sel.xpath("/html/body/div[5]/table/tbody/tr") items = [] for info in infos: item = WooyunrankautoItem() item["register_date"] = info.xpath("th[1]/text()").extract()[0] item["rank_level"] = info.xpath("th[2]/text()").extract()[0] item["essence_count"] = info.xpath("th[3]/text()").extract()[0] item["essence_ratio"] = info.xpath("th[4]/text()").extract()[0] item["nick_name"] = info.xpath("td/a/text()").extract()[0] items.append(item) return items ``` 上面的spider.py只能爬取1,2,3,4,5页(日志中显示爬取六次,第一页被重复爬取了) 但是浏览第5页的时候,6,7,8,9页也会出现啊,这里为什么没有爬取到6,7,8,9 第二个版本的spider.py ``` def parse_item(self,response): sel = Selector(response) infos = sel.xpath("/html/body/div[5]/table/tbody/tr") items = [] for info in infos: item = WooyunrankautoItem() item["register_date"] = info.xpath("th[1]/text()").extract()[0] item["rank_level"] = info.xpath("th[2]/text()").extract()[0] item["essence_count"] = info.xpath("th[3]/text()").extract()[0] item["essence_ratio"] = info.xpath("th[4]/text()").extract()[0] item["nick_name"] = info.xpath("td/a/text()").extract()[0] items.append(item) return item ``` 这个版本可以爬取所有页面,但是每个页面有20条信息,我只能取到第一条信息(循环第一条的时候就返回了,这里可以理解)但是为什么这里就可以爬取所有页面 可能是我对scrapy理解还不深入,这里实在不知道什么问题了,我想自动爬取所有页面(而且不会重复爬取),每个页面有20条信息,应该就是20个item。

scrapy中把数据存储到MongoDB,运行也没出错怎么查找不到数据库呢???

scrapy中把数据存储到MongoDB,运行爬虫也没出错,在命令提示符中怎么查找不到数据库呢??? (1)在settings.py中定义了MongoDB_URL = 'mongodb://127.0.0.1:27017' (2)在pipeline中写入代码 #def open_spider(self, spider): """ 当爬虫启动时执行,只执行一次,所以在分类爬虫启动时执行,要判断是否是分类爬虫 :param spider: :return: """ if isinstance(spider, CategorySpiderSpider): self.client = MongoClient(MongoDB_URL) self.collection = self.client['jd']['category'] #创建数据库及数据表 def process_item(self, item, spider): """ 向MongoDB插入数据 :param item: :param spider: :return: """ if isinstance(spider, CategorySpiderSpider): #插入数据 insert_one() 方法,该方法的第一参数是字典 name => value 对,item中是category对象,需要转换为字典 self.collection.insert_one(dict(item)) return item def close_spider(self, spider): """ 关闭数据库连接 :param spider: :return: """ if isinstance(spider, CategorySpiderSpider): self.client.close() (3)最后打开这个管道

采用scrapy框架爬取二手房数据,显示没有爬取到页面和项目,不清楚问题原因

1.item ``` import scrapy class LianjiaItem(scrapy.Item): # define the fields for your item here like: # 房屋名称 name = scrapy.Field() # 房屋户型 type = scrapy.Field() # 建筑面积 area = scrapy.Field() # 房屋朝向 direction = scrapy.Field() # 装修情况 fitment = scrapy.Field() # 有无电梯 elevator = scrapy.Field() # 房屋总价 total_price = scrapy.Field() # 房屋单价 unit_price = scrapy.Field() # 房屋产权 property = scrapy.Field() ``` 2.settings ``` BOT_NAME = 'lianjia' SPIDER_MODULES = ['lianjia.spiders'] NEWSPIDER_MODULE = 'lianjia.spiders' USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'lianjia.pipelines.FilterPipeline': 100, 'lianjia.pipelines.CSVPipeline': 200, } ``` 3.pipelines ``` import re from scrapy.exceptions import DropItem class FilterPipeline(object): def process_item(self,item,spider): item['area'] = re.findall(r"\d+\.?\d*",item["area"])[0] if item["direction"] == '暂无数据': raise DropItem("房屋朝向无数据,抛弃此项目:%s"%item) return item class CSVPipeline(object): index = 0 file = None def open_spider(self,spider): self.file = open("home.csv","a") def process_item(self, item, spider): if self.index == 0: column_name = "name,type,area,direction,fitment,elevator,total_price,unit_price,property\n" self.file.write(column_name) self.index = 1 home_str = item['name']+","+item['type']+","+item['area']+","+item['direction']+","+item['fitment']+","+item['elevator']+","+item['total_price']+","+item['unit_price']+","+item['property']+"\n" self.file.write(home_str) return item def close_spider(self,spider): self.file.close() ``` 4.lianjia_spider ``` import scrapy from scrapy import Request from lianjia.items import LianjiaItem class LianjiaSpiderSpider(scrapy.Spider): name = 'lianjia_spider' # 获取初始请求 def start_requests(self): # 生成请求对象 url = 'https://bj.lianjia.com/ershoufang/' yield Request(url) # 实现主页面解析函数 def parse(self, response): # 使用xpath定位到二手房信息的div元素,保存到列表中 list_selector = response.xpath("//li/div[@class = 'info clear']") # 依次遍历每个选择器,获取二手房的名称,户型,面积,朝向等信息 for one_selector in list_selector: try: name = one_selector.xpath("div[@class = 'title']/a/text()").extract_first() other = one_selector.xpath("div[@class = 'address']/div[@class = 'houseInfo']/text()").extract_first() other_list = other.split("|") type = other_list[0].strip(" ") area = other_list[1].strip(" ") direction = other_list[2].strip(" ") fitment = other_list[3].strip(" ") total_price = one_selector.xpath("//div[@class = 'totalPrice']/span/text()").extract_first() unit_price = one_selector.xpath("//div[@class = 'unitPrice']/@data-price").extract_first() url = one_selector.xpath("div[@class = 'title']/a/@href").extract_first() yield Request(url,meta={"name":name,"type":type,"area":area,"direction":direction,"fitment":fitment,"total_price":total_price,"unit_price":unit_price},callback=self.otherinformation) except: pass current_page = response.xpath("//div[@class = 'page-box house-lst-page-box']/@page-data").extract_first().split(',')[1].split(':')[1] current_page = current_page.replace("}", "") current_page = int(current_page) if current_page < 100: current_page += 1 next_url = "https://bj.lianjia.com/ershoufang/pg%d/" %(current_page) yield Request(next_url,callback=self.otherinformation) def otherinformation(self,response): elevator = response.xpath("//div[@class = 'base']/div[@class = 'content']/ul/li[12]/text()").extract_first() property = response.xpath("//div[@class = 'transaction']/div[@class = 'content']/ul/li[5]/span[2]/text()").extract_first() item = LianjiaItem() item["name"] = response.meta['name'] item["type"] = response.meta['type'] item["area"] = response.meta['area'] item["direction"] = response.meta['direction'] item["fitment"] = response.meta['fitment'] item["total_price"] = response.meta['total_price'] item["unit_price"] = response.meta['unit_price'] item["property"] = property item["elevator"] = elevator yield item ``` 提示错误: ``` de - interpreting them as being unequal if item["direction"] == '鏆傛棤鏁版嵁': 2019-11-25 10:53:35 [scrapy.core.scraper] ERROR: Error processing {'area': u'75.6', 'direction': u'\u897f\u5357', 'elevator': u'\u6709', 'fitment': u'\u7b80\u88c5', 'name': u'\u6b64\u6237\u578b\u517113\u5957 \u89c6\u91ce\u91c7\u5149\u597d \u65e0\u786c\u4f24 \u4e1a\u4e3b\u8bda\u610f\u51fa\u552e', 'property': u'\u6ee1\u4e94\u5e74', 'total_price': None, 'type': u'2\u5ba41\u5385', 'unit_price': None} Traceback (most recent call last): File "f:\python_3.6\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "F:\python_3.6\lianjia\lianjia\pipelines.py", line 25, in process_item home_str = item['name']+","+item['type']+","+item['area']+","+item['direction']+","+item['fitment']+","+item['elevator']+","+item['total_price']+","+item['unit_price']+ ","+item['property']+"\n" TypeError: coercing to Unicode: need string or buffer, NoneType found ```

4小时玩转微信小程序——基础入门与微信支付实战

4小时玩转微信小程序——基础入门与微信支付实战

Python可以这样学(第四季:数据分析与科学计算可视化)

Python可以这样学(第四季:数据分析与科学计算可视化)

组成原理课程设计(实现机器数的真值还原等功能)

实现机器数的真值还原(定点小数)、定点小数的单符号位补码加减运算、定点小数的补码乘法运算和浮点数的加减运算。

javaWeb图书馆管理系统源码mysql版本

系统介绍 图书馆管理系统主要的目的是实现图书馆的信息化管理。图书馆的主要业务就是新书的借阅和归还,因此系统最核心的功能便是实现图书的借阅和归还。此外,还需要提供图书的信息查询、读者图书借阅情况的查询等

土豆浏览器

土豆浏览器可以用来看各种搞笑、电影、电视剧视频

Java面试题大全(2020版)

发现网上很多Java面试题都没有答案,所以花了很长时间搜集整理出来了这套Java面试题大全,希望对大家有帮助哈~ 本套Java面试题大全,全的不能再全,哈哈~ 一、Java 基础 1. JDK 和 JRE 有什么区别? JDK:Java Development Kit 的简称,java 开发工具包,提供了 java 的开发环境和运行环境。 JRE:Java Runtime Environ...

Java8零基础入门视频教程

Java8零基础入门视频教程

Java基础知识面试题(2020最新版)

文章目录Java概述何为编程什么是Javajdk1.5之后的三大版本JVM、JRE和JDK的关系什么是跨平台性?原理是什么Java语言有哪些特点什么是字节码?采用字节码的最大好处是什么什么是Java程序的主类?应用程序和小程序的主类有何不同?Java应用程序与小程序之间有那些差别?Java和C++的区别Oracle JDK 和 OpenJDK 的对比基础语法数据类型Java有哪些数据类型switc...

TTP229触摸代码以及触摸返回值处理

自己总结的ttp229触摸代码,触摸代码以及触摸按键处理

网络工程师小白入门--【思科CCNA、华为HCNA等网络工程师认证】

网络工程师小白入门--【思科CCNA、华为HCNA等网络工程师认证】

深度学习原理+项目实战+算法详解+主流框架(套餐)

深度学习系列课程从深度学习基础知识点开始讲解一步步进入神经网络的世界再到卷积和递归神经网络,详解各大经典网络架构。实战部分选择当下最火爆深度学习框架PyTorch与Tensorflow/Keras,全程实战演示框架核心使用与建模方法。项目实战部分选择计算机视觉与自然语言处理领域经典项目,从零开始详解算法原理,debug模式逐行代码解读。适合准备就业和转行的同学们加入学习! 建议按照下列课程顺序来进行学习 (1)掌握深度学习必备经典网络架构 (2)深度框架实战方法 (3)计算机视觉与自然语言处理项目实战。(按照课程排列顺序即可)

java jdk 8 帮助文档 中文 文档 chm 谷歌翻译

JDK1.8 API 中文谷歌翻译版 java帮助文档 JDK API java 帮助文档 谷歌翻译 JDK1.8 API 中文 谷歌翻译版 java帮助文档 Java最新帮助文档 本帮助文档是使用谷

Ubuntu18.04安装教程

Ubuntu18.04.1安装一、准备工作1.下载Ubuntu18.04.1 LTS2.制作U盘启动盘3.准备 Ubuntu18.04.1 的硬盘空间二、安装Ubuntu18.04.1三、安装后的一些工作1.安装输入法2.更换软件源四、双系统如何卸载Ubuntu18.04.1新的改变功能快捷键合理的创建标题,有助于目录的生成如何改变文本的样式插入链接与图片如何插入一段漂亮的代码片生成一个适合你的列...

快速排序---(面试碰到过好几次)

原理:    快速排序,说白了就是给基准数据找其正确索引位置的过程.    如下图所示,假设最开始的基准数据为数组第一个元素23,则首先用一个临时变量去存储基准数据,即tmp=23;然后分别从数组的两端扫描数组,设两个指示标志:low指向起始位置,high指向末尾.    首先从后半部分开始,如果扫描到的值大于基准数据就让high减1,如果发现有元素比该基准数据的值小(如上图中18&amp;lt...

手把手实现Java图书管理系统(附源码)

手把手实现Java图书管理系统(附源码)

HTML期末大作业

这是我自己做的HTML期末大作业,花了很多时间,稍加修改就可以作为自己的作业了,而且也可以作为学习参考

Python数据挖掘简易入门

Python数据挖掘简易入门

极简JAVA学习营第四期(报名以后加助教微信:eduxy-1)

极简JAVA学习营第四期(报名以后加助教微信:eduxy-1)

C++语言基础视频教程

C++语言基础视频教程

UnityLicence

UnityLicence

软件测试2小时入门

软件测试2小时入门

YOLOv3目标检测实战:训练自己的数据集

YOLOv3目标检测实战:训练自己的数据集

Python数据分析师-实战系列

系列课程主要包括Python数据分析必备工具包,数据分析案例实战,核心算法实战与企业级数据分析与建模解决方案实战,建议大家按照系列课程阶段顺序进行学习。所有数据集均为企业收集的真实数据集,整体风格以实战为导向,通俗讲解Python数据分析核心技巧与实战解决方案。

YOLOv3目标检测实战系列课程

《YOLOv3目标检测实战系列课程》旨在帮助大家掌握YOLOv3目标检测的训练、原理、源码与网络模型改进方法。 本课程的YOLOv3使用原作darknet(c语言编写),在Ubuntu系统上做项目演示。 本系列课程包括三门课: (1)《YOLOv3目标检测实战:训练自己的数据集》 包括:安装darknet、给自己的数据集打标签、整理自己的数据集、修改配置文件、训练自己的数据集、测试训练出的网络模型、性能统计(mAP计算和画出PR曲线)和先验框聚类。 (2)《YOLOv3目标检测:原理与源码解析》讲解YOLOv1、YOLOv2、YOLOv3的原理、程序流程并解析各层的源码。 (3)《YOLOv3目标检测:网络模型改进方法》讲解YOLOv3的改进方法,包括改进1:不显示指定类别目标的方法 (增加功能) ;改进2:合并BN层到卷积层 (加快推理速度) ; 改进3:使用GIoU指标和损失函数 (提高检测精度) ;改进4:tiny YOLOv3 (简化网络模型)并介绍 AlexeyAB/darknet项目。

超详细MySQL安装及基本使用教程

一、下载MySQL 首先,去数据库的官网http://www.mysql.com下载MySQL。 点击进入后的首页如下:  然后点击downloads,community,选择MySQL Community Server。如下图:  滑到下面,找到Recommended Download,然后点击go to download page。如下图:  点击download进入下载页面选择No...

一学即懂的计算机视觉(第一季)

一学即懂的计算机视觉(第一季)

董付国老师Python全栈学习优惠套餐

购买套餐的朋友可以关注微信公众号“Python小屋”,上传付款截图,然后领取董老师任意图书1本。

爬取妹子图片(简单入门)

安装第三方请求库 requests 被网站禁止了访问 原因是我们是Python过来的 重新给一段 可能还是存在用不了,使用网页的 编写代码 上面注意看匹配内容 User-Agent:请求对象 AppleWebKit:请求内核 Chrome浏览器 //请求网页 import requests import re //正则表达式 就是去不规则的网页里面提取有规律的信息 headers = { 'User-Agent':'存放浏览器里面的' } response = requests.get

web网页制作期末大作业

分享思维,改变世界. web网页制作,期末大作业. 所用技术:html css javascript 分享所学所得

技术大佬:我去,你写的 switch 语句也太老土了吧

昨天早上通过远程的方式 review 了两名新来同事的代码,大部分代码都写得很漂亮,严谨的同时注释也很到位,这令我非常满意。但当我看到他们当中有一个人写的 switch 语句时,还是忍不住破口大骂:“我擦,小王,你丫写的 switch 语句也太老土了吧!” 来看看小王写的代码吧,看完不要骂我装逼啊。 private static String createPlayer(PlayerTypes p...

相关热词 c# 开发接口 c# 中方法上面的限制 c# java 时间戳 c#单元测试入门 c# 数组转化成文本 c#实体类主外键关系设置 c# 子函数 局部 c#窗口位置设置 c# list 查询 c# 事件 执行顺序
立即提问