python爬虫去哪网热门景点

我用python爬虫去哪网热门景点信息，结果只爬到了两页的内容，不知道是哪的问题，有大佬帮忙看看：

-- coding: utf-8 --

created by:tianxing

created date:2017-11-1

import scrapy
import re
import datetime
from practice.items import QvnaItem

class QuNaSpider(scrapy.Spider):
name = 'qvnawang'
#start_urls = ['http://sou.zhaopin.com/jobs/searchresult.ashx?pd=1&jl=%E9%80%89%E6%8B%A9%E5%9C%B0%E5%8C%BA&sm=0&sf=0&st=99999&isadv=1&sg=1545043c61dd44d5bf41f9913890abfa&p=1']
start_urls = ['http://piao.qunar.com/ticket/list.htm?keyword=%E7%83%AD%E9%97%A8%E6%99%AF%E7%82%B9&region=&from=mpl_search_suggest&subject=']
def parse(self,response):
item = QvnaItem()
#得到初始展示页面的基准xpath(某一页)
#pages = response.xpath('//div[@style="width: 224px;*width: 218px; _width:200px; float: left"]/a/@href')
pages = response.xpath('//div[@class="sight_item_pop"]/table/tr[3]/td/a/@href')

    #循环取出每一页上的每一个链接url地址，并调用parse_page函数解析每一个url上的页面内容
    for eachPage in pages:
        #获取链接URL（页面上所有的链接，每个链接单独处理）
        #singleUrl = eachPage.extract()
        singleUrl = 'http://piao.qunar.com'+eachPage.extract()
        #内部调用parse_page函数
        yield scrapy.Request(url = singleUrl,meta={'item':item},callback=self.parse_page)



    #取得除最后一页之外的 '下一页' 的xpath
    try:
        if response.xpath('//div[@class="pager"]/a/@class').extract()[0] == 'next':
            nextPage = 'http://piao.qunar.com' + response.xpath('//div[@class="pager"]/a/@href').extract()[0]
            # 递归调用，将下一页的URL传进Request函数
            yield scrapy.Request(url=nextPage, callback=self.parse)
    except IndexError as ie:
        # 因最后一页没有上述xpath，所以不满足条件，即可退出递归
        try:
            exit()
        except SystemExit as se:
            pass


#爬取单个链接对应的页面内容
def parse_page(self, response):
      # 通过meta得到item
      item = response.meta['item']


      tour_info = response.xpath('/html/body/div[2]/div[2]/div[@class="mp-description-detail"]')

      #景点名称
      try:
          item['name'] = tour_info.xpath('div[1]/span[1]/text()').extract()[0]\
          .replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('\xa0','').replace('\u3000','')
      except IndexError as ie:
          item['name'] = ''

      #景点等级
      try:
          item['rank'] = tour_info.xpath('div[1]/span[2]/text()').extract()[0]\
           .replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('\xa0','').replace('\u3000','')
      except IndexError as ie:
          item['rank'] = 0

      #景点描述
      try:
          item['decription'] = tour_info.xpath('div[2]/text()').extract()[0]\
          .replace('/',',').replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('\xa0','').replace('\u3000','')
      except IndexError as ie:
          item['decription'] = ''

      #景点地点
      try:
          item['address'] = tour_info.xpath('div[3]/span[3]/text()').extract()[0]
          item['address'] = item['address'].replace('/',',').replace(u'、','')\
               .replace(u'（',',').replace('(',',').replace(u'）','').replace(')','')\
               .replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('\xa0','').replace('\u3000','')
      except IndexError as ie:
          item['address'] = ''

      #用户评价
      try:
          item['comment'] = tour_info.xpath('div[4]/span[3]/span/text()').extract()[0]\
          .replace('/',',').replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('\xa0','').replace('\u3000','')
      except IndexError as ie:
          item['comment'] = ''

      #天气情况
      try:
          item['weather'] = tour_info.xpath('div[5]/span[3]/text()').extract()[0]\
          .replace('/',',').replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('\xa0','').replace('\u3000','')
      except IndexError as ie:
          item['weather'] = ''

      #门票最低价格
      try:
          item['lowprice'] = tour_info.xpath('div[7]/span/em/text()').extract()[0]\
          .replace('/',',').replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('\xa0','').replace('\u3000','')
      except IndexError as ie:
          item['lowprice'] = ''

      #发布日期
      today = datetime.datetime.now()
      item['date'] = today.strftime('%Y-%m-%d')



      yield item

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
threenewbee 2018-06-22 15:51
关注
用fiddler抓包看下，要么是第三页的地址或者参数没有对，要么是服务器有反爬虫的机制（比如频繁访问，返回错误页面、验证码）。

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

python爬取去哪网全国景区数据
2020-02-26 18:34

python爬取去哪网全国景区数据，爬取地址为piao.qunar.com,注意去哪网有反爬虫策略,如果ip被封，可能使用手机热点
使用python爬虫分析去哪网的景点数据
2020-08-28 11:09

swingfer的博客爬取景点的名称，热度和门票价格，并将数据存储在scenery.csv文件中 import requests from bs4 import BeautifulSoup # 正则表达式 import re import csv # 用来存储数据的csv f = open('scenery.csv', 'w', ...
python爬虫数据可视化分析大作业.zip
2020-06-12 15:39

在本项目中，"python爬虫数据可视化分析大作业.zip" 是一个综合性的学习资源，主要涉及了Python编程中的两个重要领域：网络爬虫（Web Scraping）和数据可视化（Data Visualization）。通过这个作业，我们可以深入...
python爬虫20个案例
2018-03-25 07:34

讲诉python爬虫的20个案例。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
Python爬虫详解（一看就懂）
2022-06-21 22:07

练习时长两年半的Programmer的博客爬虫简单的来说就是用程序获取网络上数据这个过程的一种名称。如果要获取网络上数据，我们要给爬虫一个网址（程序中通常叫URL），爬虫发送一个HTTP请求给目标网页的服务器，服务器返回数据给客户端（也就是我们的...
Python爬虫系列（一）——手把手教你写Python爬虫
2021-10-23 15:47

纸照片的博客这一部分我写在另一篇文章了，Python爬虫批量下载百度图片–点击跳转 5. 后言爬虫还是比较好入门的，这得益于成熟的爬虫工具。爬虫可以满足自己的个性化搜索需求，大家赶快动手试试吧。（如果觉得文章还不错的话...
Python爬虫完整代码拿走不谢
2023-03-22 09:46

q56731523的博客对于新手做Python爬虫来说是有点难处的，前期练习的时候可以直接套用模板，这样省时省力还很方便。
【Python网络爬虫案例】python爬虫之模拟登录
2024-07-01 08:30

左手の明天的博客在进行数据采集时，有些网站需要进行登录才能获取到所需的数据。本文将介绍如何使用Python爬虫进行模拟登录，以便采集网站的数据。我们提供了完善的方案和代码示例，让你能够轻松操作并获取所需的数据。
【Python】去哪儿旅游景点数据爬虫
2022-02-15 20:53

浪荡子爱自由的博客爬虫网站：去哪儿-https://travel.qunar.com/place/ 1.爬取城市ID链接例如：https://travel.qunar.com/p-cs300148-haikou # -*- coding: utf-8 -*- from bs4 import BeautifulSoup import pandas as pd import...
python爬虫入门教程：爬取网页图片
2022-04-05 15:25

plexming的博客而用python做爬虫也十分简单方便，下面通过一个简单的小爬虫程序来看一看写爬虫的基本过程：准备工作语言：python IDE：pycharm 首先是要用到的库，因为是刚入门最简单的程序，我们主要就用到下面这两： ...
没有解决我的问题, 去提问

python爬虫去哪网热门景点

-*- coding: utf-8 -*-

created by:tianxing

created date:2017-11-1

1条回答 默认 最新

-- coding: utf-8 --

1条回答默认最新