本人小白一枚,刚接触Scrapy框架没多久,写了一个简单的Spider,但是发现每一次爬取后的结果都比网页上的真实数据量要少,比如网站上一共有100条,但我爬下来的结果一般会少几条至几十条不等,很少有100条齐的时候。
整个爬虫有两部分,一部分是页面的横向爬取(进入下一页),另一个是纵向的爬取(进入页面中每一产品的详细页面)。之前我一直以为是pipelines存储到excel的时候数据丢失了,后来经过Debug调试,发现是在Spider中,数据就遗漏了,def parse函数中的item数量是齐的,包括yield Request加入到队列中,但是调用def parse_item函数时,就有些产品的详细页面无法进入。这是什么原因呢,是因为Scrapy异步加载受网速之类的影响么,本身就有缺陷,还是说是我设计上面的问题?有什么解决的方法么,不然数据量一大那丢失的不是就很严重么。
求帮助,谢谢各位了。
class MyFirstSpider(Spider):
name = "MyFirstSpider"
allowed_doamins = ["e-shenhua.com"]
start_urls = ["https://www.e-shenhua.com/ec/auction/oilAuctionList.jsp?_DARGS=/ec/auction/oilAuctionList.jsp"]
url = 'https://www.e-shenhua.com/ec/auction/oilAuctionList.jsp'
def parse(self, response):
items = []
selector = Selector(response)
contents = selector.xpath('//table[@class="table expandable table-striped"]/tbody/tr')
urldomain = 'https://www.e-shenhua.com'
for content in contents:
item = CyfirstItem()
productId = content.xpath('td/a/text()').extract()[0].strip()
productUrl = content.xpath('td/a/@href').extract()[0]
totalUrl = urldomain + productUrl
productName = content.xpath('td/a/text()').extract()[1].strip()
deliveryArea = content.xpath('td/text()').extract()[-5].strip()
saleUnit = content.xpath('td/text()').extract()[-4]
item['productId'] = productId
item['totalUrl'] = totalUrl
item['productName'] = productName
item['deliveryArea'] = deliveryArea
item['saleUnit'] = saleUnit
items.append(item)
print(len(items))
# **************进入每个产品的子网页
for item in items:
yield Request(item['totalUrl'],meta={'item':item},callback=self.parse_item)
# print(item['productId'])
# 下一页的跳转
nowpage = selector.xpath('//div[@class="pagination pagination-small"]/ul/li[@class="active"]/a/text()').extract()[0]
nextpage = int(nowpage) + 1
str_nextpage = str(nextpage)
nextLink = selector.xpath('//div[@class="pagination pagination-small"]/ul/li[last()]/a/@onclick').extract()
if (len(nextLink)):
yield scrapy.FormRequest.from_response(response,
formdata={
***************
},
callback = self.parse
)
# 产品子网页内容的抓取
def parse_item(self,response):
sel = Selector(response)
item = response.meta['item']
# print(item['productId'])
productInfo = sel.xpath('//div[@id="content-products-info"]/table/tbody/tr')
titalBidQty = ''.join(productInfo.xpath('td[3]/text()').extract()).strip()
titalBidUnit = ''.join(productInfo.xpath('td[3]/span/text()').extract())
titalBid = titalBidQty + " " +titalBidUnit
minBuyQty = ''.join(productInfo.xpath('td[4]/text()').extract()).strip()
minBuyUnit = ''.join(productInfo.xpath('td[4]/span/text()').extract())
minBuy = minBuyQty + " " + minBuyUnit
isminVarUnit = ''.join(sel.xpath('//div[@id="content-products-info"]/table/thead/tr/th[5]/text()').extract())
if(isminVarUnit == '最小变量单位'):
minVarUnitsl = ''.join(productInfo.xpath('td[5]/text()').extract()).strip()
minVarUnitdw = ''.join(productInfo.xpath('td[5]/span/text()').extract())
minVarUnit = minVarUnitsl + " " + minVarUnitdw
startPrice = ''.join(productInfo.xpath('td[6]/text()').extract()).strip().rstrip('/')
minAddUnit = ''.join(productInfo.xpath('td[7]/text()').extract()).strip()
else:
minVarUnit = ''
startPrice = ''.join(productInfo.xpath('td[5]/text()').extract()).strip().rstrip('/')
minAddUnit = ''.join(productInfo.xpath('td[6]/text()').extract()).strip()
item['titalBid'] = titalBid
item['minBuyQty'] = minBuy
item['minVarUnit'] = minVarUnit
item['startPrice'] = startPrice
item['minAddUnit'] = minAddUnit
# print(item)
return item