小玉我是龙叔呀 2017-10-21 14:25 采纳率: 0%
浏览 3788

scrapy运行爬虫时报错Missing scheme in request url

scrapy刚入门小白一枚。用网上的案例代码来玩一玩,案例是http://blog.csdn.net/czl389/article/details/77278166 中的爬取嘻哈歌词。这个案例下有三只爬虫,分别是songurls,lyrics和songinfo。我用songurls爬虫能从虾米音乐上爬取了url并保存在SongUrls.csv中,但是在用lyrics爬虫的时候会报错。信息如下
**D:\xiami2\xiami2>scrapy crawl lyrics -o Lyrics.csv
2017-10-21 21:13:29 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiami2)
2017-10-21 21:13:29 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiami2.spiders', 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 4.0; Trident/3.0)', 'FEED_URI': 'Lyrics.csv', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 0.2, 'SPIDER_MODULES': ['xiami2.spiders'], 'BOT_NAME': 'xiami2'}
2017-10-21 21:13:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled item pipelines:
['xiami2.pipelines.Xiami2Pipeline']
2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider opened
2017-10-21 21:13:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-21 21:13:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-21 21:13:31 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "d:\python3.5\lib\site-packages\scrapy\core\engine.py", line 127, in next_request
request = next(slot.start_requests)
File "d:\python3.5\lib\site-packages\scrapy\spiders\
_init__.py", line 83, in start_requests
yield Request(url, dont_filter=True)
File "d:\python3.5\lib\site-packages\scrapy\http\request__init__.py", line 25, in init
self._set_url(url)
File "d:\python3.5\lib\site-packages\scrapy\http\request__init__.py", line 58, in set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
2017-10-21 21:13:31 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-21 21:13:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 567323),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 536236)}
2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider closed (finished)
_------------------------------分割线--------------------------------------

我去查看了一下_init_.py,发现如下语句。
if ':' not in self._url:
raise ValueError('Missing scheme in request url: %s' % self._url)
网上的解决方法看了一些,都没有能解决我的问题的,因此在此讨教,望大家指点一二(真没C币了)。提问次数不多,若有格式方面缺陷还请包含。
另附上代码。

#songurls.py
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from ..items import SongUrlItem

class SongurlsSpider(scrapy.Spider):
name = 'songurls'
allowed_domains = ['xiami.com']

#将page/1到page/401,这些链接放进start_urls
start_url_list=[]
url_fixed='http://www.xiami.com/song/tag/Hip-Hop/page/'
#将range范围扩大为1-401,获得所有页面
for i in range(1,402):
    start_url_list.extend([url_fixed+str(i)])
start_urls=start_url_list

def parse(self,response):
    urls=response.xpath('//*[@id="wrapper"]/div[2]/div/div/div[2]/table/tbody/tr/td[2]/a[1]/@href').extract()
    for url in urls:
        song_url=response.urljoin(url)
        url_item=SongUrlItem()
        url_item['song_url']=song_url
        yield url_item

------------------------------分割线--------------------------------------
#lyrics.py
import scrapy
import re

class LyricsSpider(scrapy.Spider):
name = 'lyrics'
allowed_domains = ['xiami.com']
song_url_file='SongUrls.csv'

def __init__(self, *args, **kwargs):
    #从song_url.csv 文件中读取得到所有歌曲url
    f = open(self.song_url_file,"r") 
    lines = f.readlines()
    #这里line[:-1]的含义是每行末尾都是一个换行符,要去掉
    #这里in lines[1:]的含义是csv第一行是字段名称,要去掉
    song_url_list=[line[:-1] for line in lines[1:]]
    f.close()
    while '\n' in song_url_list:
        song_url_list.remove('\n')

    self.start_urls = song_url_list#[:100]#删除[:100]之后爬取全部数据

def parse(self,response):

    lyric_lines=response.xpath('//*[@id="lrc"]/div[1]/text()').extract()
    lyric=''
    for lyric_line in lyric_lines:
        lyric+=lyric_line
    #print lyric

    lyricItem=LyricItem()
    lyricItem['lyric']=lyric
    lyricItem['song_url']=response.url
    yield lyricItem

songinfo因为还没有用到所以不重要。
------------------------------分割线--------------------------------------
#items.py
import scrapy

class SongUrlItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
song_url=scrapy.Field() #歌曲链接

class LyricItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
lyric=scrapy.Field() #歌词
song_url=scrapy.Field() #歌曲链接


class SongInfoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
song_url=scrapy.Field() #歌曲链接
song_title=scrapy.Field() #歌名
album=scrapy.Field() #专辑
#singer=scrapy.Field() #歌手
language=scrapy.Field() #语种

------------------------------分割线--------------------------------------
在middleware下加了几行:

    sleep_seconds = 0.2  # 模拟点击后休眠3秒,给出浏览器取得响应内容的时间
default_sleep_seconds = 1  # 无动作请求休眠的时间

def process_request(self, request, spider):
    spider.logger.info('--------Spider request processed: %s' % spider.name)
    page = None

    driver = webdriver.PhantomJS()
    spider.logger.info('--------request.url: %s' % request.url)
    driver.get(request.url)
    driver.implicitly_wait(0.2)
    # 仅休眠数秒加载页面后返回内容
    time.sleep(self.sleep_seconds)
    page = driver.page_source
    driver.close()

    return HtmlResponse(request.url, body=page, encoding='utf-8', request=request)

------------------------------分割线--------------------------------------
setting中加了几行也改了几行:

from faker import Factory
f = Factory.create()
USER_AGENT = f.user_agent()

    DOWNLOAD_DELAY = 0.2

    DEFAULT_REQUEST_HEADERS = {
'Host': 'www.xiami.com',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cache-Control': 'no-cache',
'Connection': 'Keep-Alive',
} 

    ITEM_PIPELINES = {
'xiami2.pipelines.Xiami2Pipeline': 300,
}
  • 写回答

1条回答 默认 最新

  • 小玉我是龙叔呀 2017-10-22 08:51
    关注

    别沉啊……自己帮忙顶一发!

    评论

报告相同问题?

悬赏问题

  • ¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
  • ¥15 (希望可以解决问题)ma和mb文件无法正常打开,打开后是空白,但是有正常内存占用,但可以在打开Maya应用程序后打开场景ma和mb格式。
  • ¥15 绘制多分类任务的roc曲线时只画出了一类的roc,其它的auc显示为nan
  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?