weixin_71976922 2023-11-01 17:31 采纳率: 0%
浏览 13

启动scrapy爬虫程序,但是没有excel文件数据的产生

想爬取多页面数据(从一个页面的多个链接中获取更进一步详情),然后通过管道把数据导入到excel文件中,能够运行该程序不出错,但是excel文件中没有任何数据
doctor.py

import scrapy
from scrapy import Selector,Request
#from scrapy.http import Request
from scrapy.http import HtmlResponse
from scrapy_doctor.items import ScrapyDoctorItem
#from scrapy.linkextractors import LinkExtractor
#from scrapy.spiders import CrawlSpider,Rule


class DoctorSpider(scrapy.Spider):
    name = "doctor"
    allowed_domains = ["so.120ask.com"]
    #start_urls = ["http://so.120ask.com/?kw=%E6%8A%91%E9%83%81&page=1&isloc=1"]
    #url='http://so.120ask.com/?kw=%E6%8A%91%E9%83%81&page=%d&isloc=1'
    #page=1

    def start_requests(self):
        for page in range(2):
            yield Request(url=f'http://so.120ask.com/?kw=%E6%8A%91%E9%83%81&page={page + 1}&isloc=1')


    def parse(self, response:HtmlResponse ):
       sel=Selector(response)
       list_items=sel.css('#datalist > li')
       for list_item in list_items :
           detail_url = list_item.css('h3 > a::attr(href)').extract_first()
           yield Request(url="http:" + detail_url, callback=self.parse_detail)



    def parse_detail(self,response):
        sel = Selector(response)
        doctor_item=kwargs ['item']
        doctor_item=ScrapyDoctorItem()
        doctor_item['doctorInformation']=sel.css('span[class="b_sp1"]::text').extract_first()
        doctor_item['goodAt'] = sel.css('span[class="b_sp2"]::text').extract_first()
        doctor_item['question'] = sel.css('h1[id="askH1"]::text').extract_first()
        doctor_item['answer'] = sel.css('div[class="crazy_new"]::text').extract_first()
        yield doctor_item

item.py

import scrapy


class ScrapyDoctorItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 医生信息
    doctorInformation = scrapy.Field()
    # 擅长
    goodAt = scrapy.Field()
    # 问题
    question = scrapy.Field()
    # 回答
    answer = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import csv
import openpyxl

#from itemadapter import ItemAdapter


class ScrapyDoctorPipeline:
    def __init__(self):
        #self.fp = open('book.json', 'w', encoding='utf-8')
        self.wb=openpyxl.Workbook ()
        self.ws=self.wb.active
        self.ws.title='Doctor'
        self.ws.append(('医生信息','擅长','问题','回答'))

    def close_spider(self, spider):
        self.wb.save('百度健康抑郁问答.xlsx')
        #self.csv_file.close()


    def process_item(self, item, spider):
        #self.csv_writer.writerow(ItemAdapter(item).asdict())
        doctorInformation=item.get('doctorInformation','')
        goodAt=item.get('goodAt','')
        question=item.get('question','')
        answer=item.get('answer','')
        self.ws.append((doctorInformation,goodAt,question,answer))
        return item

settings.py

# Scrapy settings for scrapy_doctor project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_doctor"

SPIDER_MODULES = ["scrapy_doctor.spiders"]
NEWSPIDER_MODULE = "scrapy_doctor.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
   #"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_doctor.middlewares.ScrapyDoctorSpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_doctor.middlewares.ScrapyDoctorDownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "scrapy_doctor.pipelines.ScrapyDoctorPipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
#REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
#TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
#FEED_EXPORT_ENCODING = "utf-8"
#ITEM_PIPELINES = {
    # 管道可以有很多个 那么管道是有优先级 优先级的范围是1到1000 值越小优先级越高
   #'scrapy_doctor.pipelines.ScrapyDoctorPipeline': 300,
#}

  • 写回答

2条回答 默认 最新

  • CSDN-Ada助手 CSDN-AI 官方账号 2023-11-02 05:02
    关注

    【以下回答由 GPT 生成】

    问题分析: 根据提供的代码和问题描述,爬虫程序可以正常运行,但是没有将数据导入到Excel文件中。问题可能有以下几个方面: 1. 爬取到的数据没有正确传递到管道(Pipeline)中进行处理和导入Excel文件。 2. 可能爬取到的数据没有正确提取,需要对XPath或CSS选择器进行调整。 3. 在项目配置中没有开启和配置管道。

    解决方案: 1. 确保在项目的配置文件settings.py中开启了管道,并将Excel导入管道的优先级设置为较高。 python ITEM_PIPELINES = { 'scrapy_doctor.pipelines.ExcelExportPipeline': 300, } 2. 在pipelines.py文件中创建一个新的管道类ExcelExportPipeline,并将获取的数据写入Excel文件。 ```python import xlwt

    class ExcelExportPipeline(object): def init(self): self.headers = ['doctorInformation', 'goodAt', 'question', 'answer'] self.excel_file = xlwt.Workbook() self.sheet = self.excel_file.add_sheet('data') for i in range(len(self.headers)): self.sheet.write(0, i, self.headers[i])

       def process_item(self, item, spider):
           row = self.sheet.nrows
           for i, field in enumerate(self.headers):
               self.sheet.write(row, i, item.get(field))
           return item
    
       def close_spider(self, spider):
           self.excel_file.save('data.xls')
    

    `` 在这个管道类中,首先初始化Excel文件和表单,并将表头写入。然后在process_item方法中将获取到的数据写入Excel表格的对应位置。最后,在close_spider`方法中保存Excel文件。

    1. settings.py文件中配置ExcelExportPipeline所在的模块和类。 python ITEM_PIPELINES = { 'scrapy_doctor.pipelines.ExcelExportPipeline': 300, }

    尝试运行修改后的代码,检查Excel文件是否成功导入数据。若问题仍然存在,请提供更多的错误提示或说明。



    【相关推荐】



    如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^
    评论

报告相同问题?

问题事件

  • 创建了问题 11月1日

悬赏问题

  • ¥15 Windows11, backspace, enter, space键失灵
  • ¥15 cfx离心泵非稳态计算
  • ¥15 动态列线图发布后出现An error has occurred. Check your logs or contact the app author for clarification.
  • ¥20 VM虚拟机崩溃,重新登录故障,移除加密访问。
  • ¥15 双VSG并网系统,matlab,状态变量稳态值求解
  • ¥15 关于#Stata#的问题:数据是面板数据,SPSS里面不能控制年份和时间,所以只能用Stata做
  • ¥20 基于基于NioEventLoop线程阻塞问题
  • ¥20 我需要"hill48屈服模型 等向强化 非线性硬化"的abaqus本构子程序(umat或者vumat)对应的理论推导过程。
  • ¥15 基于ucc28019的pfc电路中芯片一直不工作
  • ¥15 yolov8在3588板子端c++推理报错