crawlSpider爬虫无法跟进rule中的链接

以下是代码，发现response.url一直是“http://book.douban.com/top250”，没有继续跟进去，求大神帮忙解决不胜感激

books.py

!/usr/bin/pyhon

-- coding: utf-8 --

coding=utf-8

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors import LinkExtractor
from douban.items import DoubanItem

class BooksSpider(CrawlSpider):
name = "BooksSpider"
allowed_domains = ["book.douban.com"]
start_urls = [
"http://book.douban.com/top250"
]

rules = (
    Rule(LinkExtractor(allow=
    r'https://book.douban.com/top250\?start=\d+'),callback="parse"),

    Rule(LinkExtractor(allow=
    r'https://book.douban.com/subject/\d+'),callback="parse"),
)
def parse(self, response):
    sel = Selector(response=response)
    item = DoubanItem()

    item['name'] = sel.xpath("//h1")[0].extract().strip()

    try:
        contents = sel.xpath("//div[@id='link-report']/p//text()").extract()
        item['content_desc'] = "\n".join(content for content in contents)
    except:
        item['content_desc'] = " "
    try:
        profiles = sel.xpath("//div[@class='related_info']/div[@class='indent']")[1].xpath("//div[@class='intro']/p/text()").extract()
        item['author_profile'] = "\n".join(profile for profile in profiles)
    except:
        item['author_profile'] = " "

    datas = response.xpath("//div[@id='info']//text()").extract()
    datas = [data.strip() for data in datas]
    datas = [data for data in datas if data !='']
    for data in datas:
        if u"作者" in data:
            item["author"] = datas[datas.index(data)+1]
        elif u":" not in data:
            item["author"] = datas[datas.index(data)+2]
        elif u"出版社:" in data:
            item["press"] = datas[datas.index(data)+1]
        elif u"出版年:" in data:
            item["date"] = datas[datas.index(data)+1]
        elif u"页数:" in data:
            item["page"] = datas[datas.index(data)+1]
        elif u"定价:" in data:
            item["price"] = datas[datas.index(data)+1]
        elif u"ISBN:" in data:
            item["ISBN"] = datas[datas.index(data)+1]
    print item
    return item

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
普通网友 2017-04-26 02:41
关注
建议你提供Http抓包的信息或软件自身的log和堆栈

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

Python爬虫requests.get方法无法显示div中折叠内容 https python 有问必答爬虫
2021-11-27 19:16

回答 2 已采纳该页面数据是动态加载的，需要用此链接用post请求去获取https://www.xuetangx.com/api/v1/lms/get_product_list/?page=1
Python网络爬虫中json解析失败 json python 有问必答爬虫
2022-02-26 20:51

回答 2 已采纳这个接口返回的是jsonp数据，不是json，要获取text替换掉回调函数名称和前后的括号后才是json数据
python爬虫问题求解 python 爬虫
2022-04-29 11:12

回答 1 已采纳我给你改了一下，你对比看看吧： from bs4 import BeautifulSoup import pandas as pd import requests def crawer_travel
Python爬虫之CrawlSpider爬虫
2020-05-19 19:58

琴酒网络的博客 Python爬虫之CrawlSpider爬虫一：CrawlSpider爬虫介绍二：CrawlSpider相关基础2.1 创建CrawlSpider爬虫2.2 LinkExtractors链接提取器2.3 Rule规则类三：CrawlSpider实例3.1 创建项目及爬虫3.2 定义要爬取的url规则...
python爬虫爬取到的内容无法输出到txt文档中 python
2022-08-12 12:20

回答 3 已采纳不如换用requests库和bs4库吧。 from bs4 import BeautifulSoup as bs import requests as r url = 'https://fanqie
关于使用python实现的网页爬虫程序卡死的问题 python 有问必答爬虫
2021-08-07 13:04

回答 3 已采纳你可以用time模块进行计时，每过10分钟先用os.system()重新打开程序，然后调用sys.exit()关闭旧进程如果有用，希望采纳哦~
python爬虫html获取不全 html python 爬虫
2022-06-24 19:43

回答 1 已采纳其实有的，但是这个网站应该是为了懒加载把url用base64密了一下，然后再动态加载，其实我下面发的这个就是url 是base64后的url 解码后就是https://s1.aigei.com/
Python Scrapy框架之CrawlSpider爬虫
2023-05-02 21:34

Python知识大全的博客创建CrawlSpider爬虫：之前创建爬虫的方式是通过scrapy genspider [爬虫名字] [域名]的方式创建的。scrapy genspider - c crawl [ 爬虫名字 ] [ 域名 ]LinkExtractors链接提取器：使用LinkExtractors可以不用程序员...
python爬虫网页标签个别无法读取 python 开发语言有问必答爬虫
2022-04-05 22:09

回答 3 已采纳因为个别标签字典中没有bond_nm和bond_nm_tip键 data2 = data_get['bond_nm'] data5 = data_get['bond_nm_tip']
Python爬虫配合VPN爬取出现报错 python 爬虫
2021-12-22 17:33

回答 1 已采纳你这个是VPN代理问题，你可以将VPN设置成部分代理，不要全部代理你的网络。
python爬虫学习中遇到的问题 python 爬虫
2023-02-14 11:17

回答 4 已采纳该回答引用ChatGPT根据错误信息，这个问题可能是由于连接超时或网络连接不可用导致的。因此，建议您检查以下几个方面：确保您的网络连接正常并且能够访问请求的地址。确保请求的地址正确且存在，尝试在浏览
Scrapy框架进阶一Crawlspider爬虫案例
2022-05-24 12:07

王同学在这的博客本章就来聊聊scrapy框架中的CrawlSpider，它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则Rule来提供跟进链接的方便的机制，从爬取的网页结果中获取链接并...
请问Python爬虫如何把爬取数据存入csv文件中 python 开发语言有问必答爬虫
2021-11-21 21:19

回答 1 已采纳你用open打开csv文件，然后以字符串格式写入就行了，每个数据之间用英文逗号隔开即可
Python网络爬虫(十九)——CrawlSpider
2020-05-25 11:51

止步听风的博客而 CrawlSpider 则可以通过设置 url 条件自动发送请求。 CrawlSpider 是 Spider 的一个派生类，相对于 Spider 来说，功能进行了更新，使用也更加方便。 CrawlSpider 创建 CrawlSpider 和之前创建 spider 一样，...
CrawlSpider爬虫教程
2022-03-12 10:41

qq_17584941的博客 CrawlSpider 在上一个糗事百科的爬虫案例中。我们是自己在解析完整个页面后获取下一页的url，然后重新发送一个请求。有时候我们想要这样做，只要满足某个条件的url，都给我进行爬取。...创建CrawlSpider爬虫：
没有解决我的问题, 去提问

悬赏问题

¥15 高德地图点聚合中Marker的位置无法实时更新
¥15 DIFY API Endpoint 问题。
¥20 sub地址DHCP问题
¥15 delta降尺度计算的一些细节，有偿
¥15 Arduino红外遥控代码有问题
¥15 数值计算离散正交多项式
¥30 数值计算均差系数编程
¥15 redis-full-check比较两个集群的数据出错
¥15 Matlab编程问题
¥15 训练的多模态特征融合模型准确度很低怎么办