启动scrapy爬虫程序，但是没有excel文件数据的产生

想爬取多页面数据（从一个页面的多个链接中获取更进一步详情），然后通过管道把数据导入到excel文件中，能够运行该程序不出错，但是excel文件中没有任何数据
doctor.py

import scrapy
from scrapy import Selector,Request
#from scrapy.http import Request
from scrapy.http import HtmlResponse
from scrapy_doctor.items import ScrapyDoctorItem
#from scrapy.linkextractors import LinkExtractor
#from scrapy.spiders import CrawlSpider,Rule


class DoctorSpider(scrapy.Spider):
    name = "doctor"
    allowed_domains = ["so.120ask.com"]
    #start_urls = ["http://so.120ask.com/?kw=%E6%8A%91%E9%83%81&page=1&isloc=1"]
    #url='http://so.120ask.com/?kw=%E6%8A%91%E9%83%81&page=%d&isloc=1'
    #page=1

    def start_requests(self):
        for page in range(2):
            yield Request(url=f'http://so.120ask.com/?kw=%E6%8A%91%E9%83%81&page={page + 1}&isloc=1')


    def parse(self, response:HtmlResponse ):
       sel=Selector(response)
       list_items=sel.css('#datalist > li')
       for list_item in list_items :
           detail_url = list_item.css('h3 > a::attr(href)').extract_first()
           yield Request(url="http:" + detail_url, callback=self.parse_detail)



    def parse_detail(self,response):
        sel = Selector(response)
        doctor_item=kwargs ['item']
        doctor_item=ScrapyDoctorItem()
        doctor_item['doctorInformation']=sel.css('span[class="b_sp1"]::text').extract_first()
        doctor_item['goodAt'] = sel.css('span[class="b_sp2"]::text').extract_first()
        doctor_item['question'] = sel.css('h1[id="askH1"]::text').extract_first()
        doctor_item['answer'] = sel.css('div[class="crazy_new"]::text').extract_first()
        yield doctor_item

item.py

import scrapy


class ScrapyDoctorItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 医生信息
    doctorInformation = scrapy.Field()
    # 擅长
    goodAt = scrapy.Field()
    # 问题
    question = scrapy.Field()
    # 回答
    answer = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import csv
import openpyxl

#from itemadapter import ItemAdapter


class ScrapyDoctorPipeline:
    def __init__(self):
        #self.fp = open('book.json', 'w', encoding='utf-8')
        self.wb=openpyxl.Workbook ()
        self.ws=self.wb.active
        self.ws.title='Doctor'
        self.ws.append(('医生信息','擅长','问题','回答'))

    def close_spider(self, spider):
        self.wb.save('百度健康抑郁问答.xlsx')
        #self.csv_file.close()


    def process_item(self, item, spider):
        #self.csv_writer.writerow(ItemAdapter(item).asdict())
        doctorInformation=item.get('doctorInformation','')
        goodAt=item.get('goodAt','')
        question=item.get('question','')
        answer=item.get('answer','')
        self.ws.append((doctorInformation,goodAt,question,answer))
        return item

settings.py

# Scrapy settings for scrapy_doctor project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_doctor"

SPIDER_MODULES = ["scrapy_doctor.spiders"]
NEWSPIDER_MODULE = "scrapy_doctor.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
   #"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_doctor.middlewares.ScrapyDoctorSpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_doctor.middlewares.ScrapyDoctorDownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "scrapy_doctor.pipelines.ScrapyDoctorPipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
#REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
#TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
#FEED_EXPORT_ENCODING = "utf-8"
#ITEM_PIPELINES = {
    # 管道可以有很多个 那么管道是有优先级 优先级的范围是1到1000 值越小优先级越高
   #'scrapy_doctor.pipelines.ScrapyDoctorPipeline': 300,
#}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
CSDN-Ada助手 CSDN-AI 官方账号 2023-11-02 05:02
关注
【以下回答由 GPT 生成】

问题分析：根据提供的代码和问题描述，爬虫程序可以正常运行，但是没有将数据导入到Excel文件中。问题可能有以下几个方面： 1. 爬取到的数据没有正确传递到管道（Pipeline）中进行处理和导入Excel文件。 2. 可能爬取到的数据没有正确提取，需要对XPath或CSS选择器进行调整。 3. 在项目配置中没有开启和配置管道。

解决方案： 1. 确保在项目的配置文件settings.py中开启了管道，并将Excel导入管道的优先级设置为较高。 python ITEM_PIPELINES = { 'scrapy_doctor.pipelines.ExcelExportPipeline': 300, } 2. 在pipelines.py文件中创建一个新的管道类ExcelExportPipeline，并将获取的数据写入Excel文件。 ```python import xlwt

class ExcelExportPipeline(object): def init(self): self.headers = ['doctorInformation', 'goodAt', 'question', 'answer'] self.excel_file = xlwt.Workbook() self.sheet = self.excel_file.add_sheet('data') for i in range(len(self.headers)): self.sheet.write(0, i, self.headers[i])

def process_item(self, item, spider): row = self.sheet.nrows for i, field in enumerate(self.headers): self.sheet.write(row, i, item.get(field)) return item def close_spider(self, spider): self.excel_file.save('data.xls')

`` 在这个管道类中，首先初始化Excel文件和表单，并将表头写入。然后在process_item方法中将获取到的数据写入Excel表格的对应位置。最后，在close_spider`方法中保存Excel文件。

在settings.py文件中配置ExcelExportPipeline所在的模块和类。 python ITEM_PIPELINES = { 'scrapy_doctor.pipelines.ExcelExportPipeline': 300, }

尝试运行修改后的代码，检查Excel文件是否成功导入数据。若问题仍然存在，请提供更多的错误提示或说明。

【相关推荐】

给你找了一篇非常好的博客，你可以看看是否有帮助，链接：Python爬虫5.2 — scrapy框架pipeline模块的使用
除此之外, 这篇博客: scrapy爬虫框架入门实战中的 4.编辑管道文件(pipelines.py) 部分也许能够解决你的问题。

如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

python scrapy爬虫如果想要下一页但是没有href python 爬虫
2022-12-14 00:18

回答 1 已采纳你要模拟参数，具体代码如下： import http.client conn = http.client.HTTPSConnection("chl.cn") #page 5 #submit 下一页
scrapy爬虫相关关于json数据的处理 json 爬虫
2018-03-14 09:42

回答 1 已采纳使用 JSON 函数需要导入 json 库：import json
scrapy怎么没打印数据？ python
2021-07-21 12:31

回答 3 已采纳 allowed_domains = ["jobui.com"] 好像是不要WWW 或者这个直接不写 allowed_domains = ["jobui.com"]
基于scrapy以Django为后端的校园资讯采集的微信小程序
2022-04-22 23:58

Gowi_fly的博客一个采用scrapy爬虫以Django为后端的微信小程序项目地址：https://github.com/WGowi/USCCampusEastStreet 文章目录USCCampusEastStreet第一章系统概要分析1.1 系统总体设计1.1.1 系统组成部分分析1.1.2 系统运行...
scrapy爬虫无法翻页 python 有问必答
2021-05-06 08:49

回答 5 已采纳代码中的allow_domains有误，应该为：['louqiuzw.com']。原链接会出现连接丢失错误，与没有传递headers或网页响应慢有关。尝试测试其他网页，在start_urls列出测试网
请问Python爬虫如何把爬取数据存入csv文件中 python 开发语言有问必答爬虫
2021-11-21 21:19

回答 1 已采纳你用open打开csv文件，然后以字符串格式写入就行了，每个数据之间用英文逗号隔开即可
请问为什么，我无法创建scrapy爬虫项目 python 爬虫
2022-02-07 19:41

回答 1 已采纳库安装失败了
Python应用-Scrapy爬虫之拉勾网招聘数据分析
2023-11-11 12:08

Radish_c的博客将爬虫爬取到的数据存入MongoDB数据库中。
scrapy在创建爬虫文件时候url的错误 python 爬虫
2023-04-19 23:01

回答 1 已采纳是这样的，没问题。scrapy认为加了/b/的不是一个正确的网站，因为一般的网站首页都是.com .cn这样结尾的。所以默认去掉了后面的。需要自己手动修改的。
在以瀑布流方式翻页的网站,使用scrapy网络爬虫,但是只爬取了第一页数据,没有爬取第二页. python 爬虫
2021-09-05 19:18

回答 2 已采纳那叫ajax，
爬虫scrapy框架爬不出来，但是request可以出来 http python 爬虫
2022-05-06 00:26

回答 2 已采纳你应该继承 scrapy.SpiderCrawlSpider 不要自定义 parse 函数。
零基础Python爬虫48小时速成教程[视频课程].txt打包整理.zip
2022-03-07 10:01

【零基础Python爬虫48小时速成教程】是一门专为初学者设计的全面而紧凑的编程课程，旨在帮助学员在短时间内掌握Python爬虫的基本...通过系统的学习，学员不仅能掌握Python爬虫技术，还能开启数据分析和后端开发的大门。
scrapy爬虫不自动翻页问题爬虫
2021-11-14 09:17

回答 2 已采纳 scrapy框架里面 start_urls里面装的是网页列表，你在上面贴的代码里只放了一个url，所以他只会一直爬这一个网页。用for循环构造出url，然后添加进statrt_urls，然后再运行就解
爬虫基本原理
2023-06-28 01:20

每天早睡的博客 1 爬虫介绍。
在职爬虫工程师，带给大家超简单 Python 爬虫教程
2023-02-07 14:01

梦想橡皮擦的博客 3. Python 爬虫开发：介绍如何使用 Python 进行爬虫开发，包括安装需要的第三方库，爬取网页的方法和如何处理获取的数据。 4. 爬虫技巧：介绍如何提高爬虫的效率，例如如何避免 IP 封禁，如何加速爬取速度等。 5. ...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 11月1日

悬赏问题

¥15 Windows11, backspace, enter, space键失灵
¥15 cfx离心泵非稳态计算
¥15 动态列线图发布后出现An error has occurred. Check your logs or contact the app author for clarification.
¥20 VM虚拟机崩溃，重新登录故障，移除加密访问。
¥15 双VSG并网系统，matlab，状态变量稳态值求解
¥15 关于#Stata#的问题：数据是面板数据，SPSS里面不能控制年份和时间，所以只能用Stata做
¥20 基于基于NioEventLoop线程阻塞问题
¥20 我需要"hill48屈服模型等向强化非线性硬化"的abaqus本构子程序（umat或者vumat)对应的理论推导过程。
¥15 基于ucc28019的pfc电路中芯片一直不工作
¥15 yolov8在3588板子端c++推理报错

启动scrapy爬虫程序，但是没有excel文件数据的产生

2条回答 默认 最新

问题事件

悬赏问题

2条回答默认最新