April_Leon 2019-11-20 09:29 采纳率: 0%
浏览 532
已结题

用scrapy爬取谷歌应用市场

我在用scrapy框架爬谷歌应用市场,但是只爬了不到10000个app,有大神解答一下这是为什么吗?应该不是被ban的原因,因为我设置了ua池和代理IP。
具体代码如下:

# -*- coding: utf-8 -*-
import scrapy

# from scrapy.spiders import CrawlSpider, Rule
# from scrapy.linkextractors import LinkExtractor
# from html.parser import HTMLParser as SGMLParser
from scrapy import Request
from urllib.parse import urljoin

from gp.items import GpItem


class GoogleSpider(scrapy.Spider):
    # print("HELLO STARTING")
    name = 'google'
    allowed_domains = ['play.google.com']
    start_urls = ['https://play.google.com/store/apps/']

    '''
    rules = [
        Rule(LinkExtractor(allow=("https://play\.google\.com/store/apps/details",)), callback='parse_app', follow=True),
    ]
    '''

    def parse(self, response):
        print("Calling Parse")
        selector = scrapy.Selector(response)

        urls = selector.xpath('//div[@class="LNKfBf"]/ul/li[@class="CRHL7b eZdLre"]/ul[@class="TEOqAc"]/li[@class="KZnDLd"]/a[@class="r2Osbf"]/@href').extract()
        print(urls)
        link_flag = 0

        links = []
        for link in urls:
            links.append(link)

        for each in urls:
            yield Request(url="http://play.google.com" + links[link_flag], callback=self.parse_more, dont_filter=True)
            print("http://playgoogle.com" + links[link_flag])
            link_flag += 1

    def parse_more(self, response):
        selector = scrapy.Selector(response)

        # print(response.body)

        urls = selector.xpath('//a[@class="LkLjZd ScJHi U8Ww7d xjAeve nMZKrb  id-track-click "]/@href').extract()

        link_flag = 0

        links = []
        for link in urls:
            # print("LINK" + str(link))
            links.append(link)

        for each in urls:
            yield Request(url="http://play.google.com" + links[link_flag], callback=self.parse_next, dont_filter=True)
            # print("http://play.google.com" + links[link_flag])
            link_flag += 1

    def parse_next(self, response):
        selector = scrapy.Selector(response)

        # print(response)
        # app_urls = selector.xpath('//div[@class="details"]/a[@class="title"]/@href').extract()

        app_urls = selector.xpath('//div[@class="Vpfmgd"]/div[@class="RZEgze"]/div[@class="vU6FJ p63iDd"]/'
                                  'a[@class="JC71ub"]/@href').extract()

        urls = []
        for url in app_urls:
            url = "http://play.google.com" + url
            print(url)
            urls.append(url)

        link_flag = 0
        for each in app_urls:
            yield Request(url=urls[link_flag], callback=self.parse_app, dont_filter=True)
            link_flag += 1

    def parse_app(self, response):
        item = GpItem()
        item['app_url'] = response.url
        item['app_name'] = response.xpath('//h1[@itemprop="name"]/span').xpath('text()').get()
        item['app_icon'] = response.xpath('//img[@itemprop="image"]/@src').get()
        item['app_rate'] = response.xpath('//div[@class="K9wGie"]/div[@class="BHMmbe"]').xpath('text()').get()
        item['app_version'] = response.xpath('//div[@class="IQ1z0d"]/span[@class="htlgb"]').xpath('text()').get()
        item['app_description'] = response.xpath('//div[@itemprop="description"]/span/div').xpath('text()').get()
        # item['app_developer'] = response.xpath('//')
        # print(response.text)
        yield item

另一个问题是我能不能通过定义关键词来爬取特定类型的app呀?如果可以的话那在scrapy中该怎么实现呢?
拜托各位大神帮我解答一下吧!

  • 写回答

1条回答 默认 最新

  • threenewbee 2019-11-20 10:24
    关注

    换一个ip看看,你用代理访问下 wwww.ip138.com 看下能不能报告你的ip,如果能,说明你的代理不是匿名代理,服务器还是能看到你的ip
    另外,过一段时间访问,如果又可以,说明还是反爬造成的。google有多层次的反爬,比如基于用户操作习惯的分析。你一直一个频率刷,还是很容易被识别的。

    评论

报告相同问题?

悬赏问题

  • ¥15 delta降尺度计算的一些细节,有偿
  • ¥15 Arduino红外遥控代码有问题
  • ¥15 数值计算离散正交多项式
  • ¥30 数值计算均差系数编程
  • ¥15 redis-full-check比较 两个集群的数据出错
  • ¥15 Matlab编程问题
  • ¥15 训练的多模态特征融合模型准确度很低怎么办
  • ¥15 kylin启动报错log4j类冲突
  • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大
  • ¥15 import arcpy出现importing _arcgisscripting 找不到相关程序