m0_58990004 2021-08-03 18:51 采纳率: 100%
浏览 352
已结题

用scrapy爬取站长素材无法下载图片

我跟着课程做案例,前面代码基本已经检查过,没有问题,程序都可以正常运行,拿到图片url后发送请求下载失败,储存路径和名字都已经检查,能创建文件夹,但无法拿到图片,请求帮助。(已经确认没有cookie,防盗链的反爬机制,图片url可以正常打开)
下面是代码:
源文件:

# -*- coding:utf-8 -*-
import scrapy
from imgsPro.items import ImgsproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="container"]/div')
        for div in div_list:
            # 图片懒加载,动态加载后src,为没有浏览器页面加载时为src2,
            #注意:使用伪属性(不一定是src2,也可能是其他)
            src2 = 'http:'+div.xpath('./div/a/img/@src2').extract_first()
            #print(src2)
            
            item = ImgsproItem()
            item['src2'] = src2

            yield item


settings:

# Scrapy settings for imgsPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'imgsPro'

SPIDER_MODULES = ['imgsPro.spiders']
NEWSPIDER_MODULE = 'imgsPro.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
LOG_LEVEL = 'ERROR'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'imgsPro.pipelines.imgsPilepline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#指定图片的存储目录(没有会自行创建)
IMAGES_STORE = './imgs'


items:

import scrapy


class ImgsproItem(scrapy.Item):
    # define the fields for your item here like:
    src2 = scrapy.Field()
    # pass


piplines:

import scrapy
class imgsPilepline(ImagesPipeline):

    #就是可以根据图片地进行图片数据的请求
    def get_media_requests(self, item, info):
        print(item['src2'])
        #yield scrapy.Request(item['src2']) #不需要callback回调进行数据解析
        yield scrapy.Request(url=item['src2'])

    #指定图片存储的路径
    def file_path(self, request, response=None, info=None, *, item=None):
        #在setting设置路径:
            #IMAGES_STORE = './imgs'(没有会自行创建)

        imgName = 'test.jpg'  # request.url.split('/')[-1]
        return imgName  # 只需要返回图片名称

    def item_completed(self, results, item, info):
        print(results) #测试
        return item #返回给下一个即将执行的管道类(没有可不写)

结果:

(pythonProject) C:\Users\13564\Desktop\pythonProject\imgsPro>scrapy crawl img
http://scpic2.chinaz.net/Files/pic/pic9/202107/bpic23825_s.jpg
http://scpic.chinaz.net/Files/pic/pic9/202107/bpic23823_s.jpg
http://scpic.chinaz.net/Files/pic/pic9/202107/bpic23824_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/bpic23826_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/bpic23828_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/bpic23827_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/apic34194_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/apic34190_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/apic34189_s.jpg
http://scpic.chinaz.net/Files/pic/pic9/202107/apic34191_s.jpg
http://scpic.chinaz.net/Files/pic/pic9/202107/apic34193_s.jpg
http://scpic.chinaz.net/Files/pic/pic9/202107/apic34192_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/hpic4260_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/hpic4257_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/hpic4259_s.jpg
http://scpic3.chinaz.net/Files/pic/pic9/202107/hpic4256_s.jpg
http://scpic1.chinaz.net/Files/pic/pic9/202107/hpic4255_s.jpg
http://scpic1.chinaz.net/Files/pic/pic9/202107/hpic4258_s.jpg
http://scpic1.chinaz.net/Files/pic/pic9/202107/apic34327_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/apic34251_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/apic34253_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/apic34250_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/apic34249_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/apic34252_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/apic34254_s.jpg
http://scpic1.chinaz.net/Files/pic/pic9/202107/bpic23818_s.jpg
http://scpic1.chinaz.net/Files/pic/pic9/202107/bpic23822_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/bpic23819_s.jpg
http://scpic2.chinaz.net/Files/pic/pic9/202107/bpic23817_s.jpg
http://scpic.chinaz.net/Files/pic/pic9/202107/bpic23821_s.jpg
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]
[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)]


请帮一下我

  • 写回答

2条回答 默认 最新

  • m0_58990004 2021-08-03 22:08
    关注

    找到原因了,是要在setting中加上MEDIA_ALLOW_REDIRECTS = True,貌似是中间件的内容,我还没学到,所以不清楚什么意思,有大佬可以解释一下吗
    看所有日志后会发现其实有地方报错了,直接复制到百度是告诉我加上上述语句就可以了。但如果setting中有LOG_LEVEL = 'ERROR'是不会报错的。

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

问题事件

  • 系统已结题 8月11日
  • 已采纳回答 8月3日
  • 创建了问题 8月3日

悬赏问题

  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 保护模式-系统加载-段寄存器