weixin_33691700 2017-01-17 14:33 采纳率: 0%
浏览 45

爬虫分页怎么失败了?

这是我第一次提问~~我正在制作一个网络爬虫,我想用它来爬取invia.cz上所有的酒店链接和名称。

import scrapy


y=0
class invia(scrapy.Spider):
    name = 'Kreta'
    start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']

    def parse(self, response):

        for x in range (1, 9):
            yield {
             'titles':response.css("#main > div > div > div > div.col.col-content > div.product-list > div > ul > li:nth-child(%d)>div.head>h2>a>span.name::text"%(x)).extract() ,
             }

        if (response.css('#main > div > div > div > div.col.col-content >   
                            div.product-list > div > p > 
                            a.next').extract_first()):
         y=y+1
         go = ["https://dovolena.invia.cz/d_start_from=13.01.2017&sort=nl_sell&page=%d" % y] 
         print go
         yield scrapy.Request(
                response.urljoin(go),
                callback=self.parse
         )

这个网站页面是用Ajax加载的,我手动更改了URL的值,只有当Next按钮出现在页面中时,才会增加一个URL值。当我测试按钮是否出现时,所有条件都运行得很好,但是当我启动爬虫时,它只爬取第一页。这是我第一个爬虫项目,可能还做的不是很成熟,总之先谢谢你的解答!

错误日志在这:Error Log1 Error Log

  • 写回答

1条回答 默认 最新

  • weixin_33733810 2017-01-17 20:03
    关注

    Your usage of "global" y variable is not only peculiar but won't work either

    You're using y to calculate how many times parse was called. Ideally you don't want to access anything outside of the functions scope, so you can achieve the same thing with using request.meta attribute:

    def parse(self, response):
        y = response.meta.get('index', 1)  # default is page 1
        y += 1
        # ...
        #next page 
        url = 'http://example.com/?p={}'.format(y)
        yield Request(url, self.parse, meta={'index':y})
    

    Regarding your pagination issue, your next page url css selector is incorrect since the <a> node you're selecting doesn't have a absolute href attached to it, also this issue makes your y issue obsolete. To solve this try:

    def parse(self, response):
        next_page = response.css("a.next::attr(data-page)").extract_first()
        # replace "page=1" part of the url with next number
        url = re.sub('page=\d+', 'page=' + next_page, response.url)
        yield Request(url, self.parse, meta={'index':y})
    

    EDIT: Here's the whole working spider:

    import scrapy
    import re
    
    
    class InviaSpider(scrapy.Spider):
        name = 'invia'
        start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']
    
        def parse(self, response):
            names = response.css('span.name::text').extract()
            for name in names:
                yield {'name': name}
    
            # next page
            next_page = response.css("a.next::attr(data-page)").extract_first()
            url = re.sub('page=\d+', 'page=' + next_page, response.url)
            yield scrapy.Request(url, self.parse)
    
    评论

报告相同问题?

悬赏问题

  • ¥15 delta降尺度计算的一些细节,有偿
  • ¥15 Arduino红外遥控代码有问题
  • ¥15 数值计算离散正交多项式
  • ¥30 数值计算均差系数编程
  • ¥15 redis-full-check比较 两个集群的数据出错
  • ¥15 Matlab编程问题
  • ¥15 训练的多模态特征融合模型准确度很低怎么办
  • ¥15 kylin启动报错log4j类冲突
  • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大
  • ¥15 import arcpy出现importing _arcgisscripting 找不到相关程序