majunyu987 2020-02-27 13:45 采纳率: 0%
浏览 799
已结题

scrapy爬取图片url爬取不到

爬取不到网页图片的下载地址,别的id和name都可以得到
不知道是不是正则表达式的问题

爬取网站链接:https://www.ssense.com/en-cn/women?q=top

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from ssense.items import SsenseItem
import  re

class SsensePicSpider(scrapy.Spider):
    name = 'ssense_pic'
    allowed_domains = ['ssense.com']
    start_urls = ['http://ssense.com/']

    def parse(self, response):#定义解析函数
        search_word = 'top'#查找词,可修改
        for i in range(1, 2):#爬取所有网页
            url = 'http://www.ssense.com/en-cn/women?q=' + str(search_word) + '&page=' + str(i)
            #print(url)
            yield Request(url=url, callback=self.page)
        pass

    # 爬取商品url
    def page(self, response):
        body = response.body.decode('utf-8', 'ignore')
        url_id = '"url":\s"([/a-z-0-9]*)"'
        item_id = re.compile(url_id).findall(body)  #获取商品url
        #print(item_id)
        for i in range(0, len(item_id)):
            this_id = item_id[i]
            website = 'https://www.ssense.com/en-cn' + str(this_id)  # 商品链接
            yield Request(url=website, callback=self.next)
            pass
        pass

    def next(self, response):
        item = SsenseItem()
        body = response.body.decode('utf-8', 'ignore')
        # 获取商品productID
        pro_id = '"productID":\s(\d{7})'
        productID = re.compile(pro_id).findall(body)
        item['productID'] = productID


        #获取商品name
        item_name = '"name":\s"([a-zA-Z -]*)"[,]'
        name = re.compile(item_name).findall(body)
        item['name'] = name

        #获取商品price
        item_price = '"price":\s([0-9]*)'
        price = re.compile(item_price).findall(body)
        item['price'] = price

        # 获取sku
        item_sku = '"sku":\s"([0-9A-Z]*)",'
        sku = re.compile(item_sku).findall(body)
        item['sku'] = sku

        #获取图片url
        item_image = '"image":\s"([a-z:/.0-9A-F_-]*)"'
        image = re.compile(item_image).findall(body)
        item['image'] = image
        print(type(image))

        yield item

    pass

  • 写回答

1条回答

  • threenewbee 2020-02-27 17:39
    关注

    图片应该是 https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_1086,w_724/c_scale,h_480/f_auto,dpr_1.0/201071F110010_1.jpg

    data-srcset="后面的",不知道你的 image: 这个是什么鬼。

    <picture data-v-60b7d3e3=""><source data-v-60b7d3e3="" data-srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_1086,w_724/c_scale,h_480/f_auto,dpr_1.0/201071F110010_1.jpg" media="(min-width: 1025px)" srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_1086,w_724/c_scale,h_480/f_auto,dpr_1.0/201071F110010_1.jpg"><source data-v-60b7d3e3="" data-srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_320/f_auto,dpr_1.0/201071F110010_1.jpg" media="(min-width: 768px)" srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_320/f_auto,dpr_1.0/201071F110010_1.jpg"><img data-v-60b7d3e3="" data-srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_280/f_auto,dpr_1.0/201071F110010_1.jpg" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAXUAAAIwAQMAAABDTmnJAAAAA1BMVEUAAACnej3aAAAAAXRSTlMAQObYZgAAADFJREFUeNrtwTEBAAAAwiD7pzbDfmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANEBaQAAAZUbkzMAAAAASUVORK5CYII=" alt="Live the Process - Grey Seamless Sport Top" class="product-thumbnail lazyloaded" srcset="https://cldny.ccindex.cn/ssenseweb/image/upload/b_white,c_lpad,g_south,h_706,w_470/c_scale,h_280/f_auto,dpr_1.0/201071F110010_1.jpg"></picture>
    
    评论

报告相同问题?

悬赏问题

  • ¥15 矩阵加法的规则是两个矩阵中对应位置的数的绝对值进行加和
  • ¥15 活动选择题。最多可以参加几个项目?
  • ¥15 飞机曲面部件如机翼,壁板等具体的孔位模型
  • ¥15 vs2019中数据导出问题
  • ¥20 云服务Linux系统TCP-MSS值修改?
  • ¥20 关于#单片机#的问题:项目:使用模拟iic与ov2640通讯环境:F407问题:读取的ID号总是0xff,自己调了调发现在读从机数据时,SDA线上并未有信号变化(语言-c语言)
  • ¥20 怎么在stm32门禁成品上增加查询记录功能
  • ¥15 Source insight编写代码后使用CCS5.2版本import之后,代码跳到注释行里面
  • ¥50 NT4.0系统 STOP:0X0000007B
  • ¥15 想问一下stata17中这段代码哪里有问题呀