scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

2个回答

图片说明

图片说明

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
scrapy运行爬虫时报错Missing scheme in request url

scrapy刚入门小白一枚。用网上的案例代码来玩一玩,案例是http://blog.csdn.net/czl389/article/details/77278166 中的爬取嘻哈歌词。这个案例下有三只爬虫,分别是songurls,lyrics和songinfo。我用songurls爬虫能从虾米音乐上爬取了url并保存在SongUrls.csv中,但是在用lyrics爬虫的时候会报错。信息如下 **D:\xiami2\xiami2>scrapy crawl lyrics -o Lyrics.csv 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiami2) 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiami2.spiders', 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 4.0; Trident/3.0)', 'FEED_URI': 'Lyrics.csv', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 0.2, 'SPIDER_MODULES': ['xiami2.spiders'], 'BOT_NAME': 'xiami2'} 2017-10-21 21:13:29 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled item pipelines: ['xiami2.pipelines.Xiami2Pipeline'] 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider opened 2017-10-21 21:13:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-21 21:13:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-21 21:13:31 [scrapy.core.engine] ERROR: Error while obtaining start requests Traceback (most recent call last): File "d:\python3.5\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request request = next(slot.start_requests) File "d:\python3.5\lib\site-packages\scrapy\spiders\__init__.py", line 83, in start_requests yield Request(url, dont_filter=True) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__ self._set_url(url) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Closing spider (finished) 2017-10-21 21:13:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 567323), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 536236)} 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider closed (finished) _------------------------------分割线--------------------------------------_ 我去查看了一下_init_.py,发现如下语句。 if ':' not in self._url: raise ValueError('Missing scheme in request url: %s' % self._url) 网上的解决方法看了一些,都没有能解决我的问题的,因此在此讨教,望大家指点一二(真没C币了)。提问次数不多,若有格式方面缺陷还请包含。 另附上代码。 #songurls.py import scrapy import re from scrapy.spiders import CrawlSpider, Rule from ..items import SongUrlItem class SongurlsSpider(scrapy.Spider): name = 'songurls' allowed_domains = ['xiami.com'] #将page/1到page/401,这些链接放进start_urls start_url_list=[] url_fixed='http://www.xiami.com/song/tag/Hip-Hop/page/' #将range范围扩大为1-401,获得所有页面 for i in range(1,402): start_url_list.extend([url_fixed+str(i)]) start_urls=start_url_list def parse(self,response): urls=response.xpath('//*[@id="wrapper"]/div[2]/div/div/div[2]/table/tbody/tr/td[2]/a[1]/@href').extract() for url in urls: song_url=response.urljoin(url) url_item=SongUrlItem() url_item['song_url']=song_url yield url_item ------------------------------分割线-------------------------------------- #lyrics.py import scrapy import re class LyricsSpider(scrapy.Spider): name = 'lyrics' allowed_domains = ['xiami.com'] song_url_file='SongUrls.csv' def __init__(self, *args, **kwargs): #从song_url.csv 文件中读取得到所有歌曲url f = open(self.song_url_file,"r") lines = f.readlines() #这里line[:-1]的含义是每行末尾都是一个换行符,要去掉 #这里in lines[1:]的含义是csv第一行是字段名称,要去掉 song_url_list=[line[:-1] for line in lines[1:]] f.close() while '\n' in song_url_list: song_url_list.remove('\n') self.start_urls = song_url_list#[:100]#删除[:100]之后爬取全部数据 def parse(self,response): lyric_lines=response.xpath('//*[@id="lrc"]/div[1]/text()').extract() lyric='' for lyric_line in lyric_lines: lyric+=lyric_line #print lyric lyricItem=LyricItem() lyricItem['lyric']=lyric lyricItem['song_url']=response.url yield lyricItem songinfo因为还没有用到所以不重要。 ------------------------------分割线-------------------------------------- #items.py import scrapy class SongUrlItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 class LyricItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() lyric=scrapy.Field() #歌词 song_url=scrapy.Field() #歌曲链接 class SongInfoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 song_title=scrapy.Field() #歌名 album=scrapy.Field() #专辑 #singer=scrapy.Field() #歌手 language=scrapy.Field() #语种 ------------------------------分割线-------------------------------------- 在middleware下加了几行: sleep_seconds = 0.2 # 模拟点击后休眠3秒,给出浏览器取得响应内容的时间 default_sleep_seconds = 1 # 无动作请求休眠的时间 def process_request(self, request, spider): spider.logger.info('--------Spider request processed: %s' % spider.name) page = None driver = webdriver.PhantomJS() spider.logger.info('--------request.url: %s' % request.url) driver.get(request.url) driver.implicitly_wait(0.2) # 仅休眠数秒加载页面后返回内容 time.sleep(self.sleep_seconds) page = driver.page_source driver.close() return HtmlResponse(request.url, body=page, encoding='utf-8', request=request) ------------------------------分割线-------------------------------------- setting中加了几行也改了几行: from faker import Factory f = Factory.create() USER_AGENT = f.user_agent() DOWNLOAD_DELAY = 0.2 DEFAULT_REQUEST_HEADERS = { 'Host': 'www.xiami.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'no-cache', 'Connection': 'Keep-Alive', } ITEM_PIPELINES = { 'xiami2.pipelines.Xiami2Pipeline': 300, }

Scrapy FormRequest函数中的meta参数值应该如何设置?

我用scrapy进行爬虫,解析函数部分另有下一级回调函数,代码如下: ``` item = SoccerDataItem() for i in range(1, 8): item['player' + str(i + 1)] = players[i] for j in range(1, 8): home_sub_list = response.xpath('//div[@class="left"]//li[@class="pl10"]') if home_sub_list[j - 1].xpath('./span/img[contains(@src,"subs_up")]'): item['player' + str(j)]['name'] = home_sub_list[j - 1].xpath('./div[@class="ml10"]').xpath('string(.)').re_first('\d{1,2}\xa0\xa0(.*)') item['player' + str(j)]['team_stand'] = 1 item['player' + str(j)]['is_startup'] = 0 item['player' + str(j)]['is_subs_up'] = 1 item['player' + str(j)]['subs_up_time'] = home_sub_list[j].xpath('./span/img[contains(@src,"subs_up")]/following-sibling::span').xpath('string(.)').extract_first(default='') yield scrapy.FormRequest(url=data_site, formdata=formdata, meta={'player': item['player' + str(j)]}, callback=self.parse_data) else: item['player' + str(j)]['name'] = home_sub_list[j-1].xpath('./div[@class="ml10"]').xpath('string(.)').re_first('\d{1,2}\xa0\xa0(.*)') item['player' + str(j)]['team_stand'] = 1 item['player' + str(j)]['is_startup'] = 0 item['player' + str(j)]['is_subs_up'] = 0 ``` 然而运行后一直在报错: ``` callback=self.parse_data) File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__ querystr = _urlencode(items, self.encoding) File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 66, in _urlencode for k, vs in seq File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 67, in <listcomp> for v in (vs if is_listlike(vs) else [vs])] File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\utils\python.py", line 119, in to_bytes 'object, got %s' % type(text).__name__) TypeError: to_bytes must receive a unicode, str or bytes object, got int ``` 据本人百度得知,meta当中的键值对的值应为字符串,字节等类型,这正是当我传入字典类型时报错的原因。 可是,请问我应该如何修改此处呢? PS:本人所用编程语言为Python,排版可能会引起读者不适,望谅解!

为什么我用scrapy爬取谷歌应用市场却爬取不到内容?

我想用scrapy爬取谷歌应用市场,代码没有报错,但是却爬取不到内容,这是为什么? ``` # -*- coding: utf-8 -*- import scrapy # from scrapy.spiders import CrawlSpider, Rule # from scrapy.linkextractors import LinkExtractor from gp.items import GpItem # from html.parser import HTMLParser as SGMLParser import requests class GoogleSpider(scrapy.Spider): name = 'google' allowed_domains = ['https://play.google.com/'] start_urls = ['https://play.google.com/store/apps/'] ''' rules = [ Rule(LinkExtractor(allow=("https://play\.google\.com/store/apps/details",)), callback='parse_app', follow=True), ] ''' def parse(self, response): selector = scrapy.Selector(response) urls = selector.xpath('//a[@class="LkLjZd ScJHi U8Ww7d xjAeve nMZKrb id-track-click"]/@href').extract() link_flag = 0 links = [] for link in urls: links.append(link) for each in urls: yield scrapy.Request(links[link_flag], callback=self.parse_next, dont_filter=True) link_flag += 1 def parse_next(self, response): selector = scrapy.Selector(response) app_urls = selector.xpath('//div[@class="details"]/a[@class="title"]/@href').extract() print(app_urls) urls = [] for url in app_urls: url = "http://play.google.com" + url print(url) urls.append(url) link_flag = 0 for each in app_urls: yield scrapy.Request(urls[link_flag], callback=self.parse_app, dont_filter=True) link_flag += 1 def parse_app(self, response): item = GpItem() item['app_url'] = response.url item['app_name'] = response.xpath('//div[@itemprop="name"]').xpath('text()').extract() item['app_icon'] = response.xpath('//img[@itempro="image"]/@src') item['app_developer'] = response.xpath('//') print(response.text) yield item ``` terminal运行信息如下: ``` BettyMacbookPro-764:gp zhanjinyang$ scrapy crawl google 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: gp) 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.1 (default, Dec 14 2018, 13:28:58) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Darwin-18.5.0-x86_64-i386-64bit 2019-11-12 08:46:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'gp', 'NEWSPIDER_MODULE': 'gp.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['gp.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'} 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet Password: b2d7dedf1f4a91eb 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled item pipelines: ['gp.pipelines.GpPipeline'] 2019-11-12 08:46:45 [scrapy.core.engine] INFO: Spider opened 2019-11-12 08:46:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-12 08:46:45 [py.warnings] WARNING: /anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://play.google.com/ in allowed_domains. warnings.warn(message, URLWarning) 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-12 08:46:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/robots.txt> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/store/apps/> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-12 08:46:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 810, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 232419, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 12, 8, 46, 46, 474543), 'log_count/DEBUG': 2, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'memusage/max': 58175488, 'memusage/startup': 58175488, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2019, 11, 12, 8, 46, 45, 562775)} 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Spider closed (finished) ``` 求助!!!

python3 scrapy Request 请求时怎么保持headers 的参数首字母不大写

python3 scrapy Request 请求时,scrapy 会自动将headers 中的参数 格式化,使其保持首字母大写,下划线等特殊符号后第一个字母大写。但现在有个问题 我要往服务端传一个headers的参数,但参数本身没有大写,经过scrapy 请求后参数变为首字母大写,服务器端根本不认这个参数,我就想问下有谁知道scrapy,Request 有不处理headers的方法吗? 但使用requests请求时,而不是用scrapy.Request时,headers 是没有变化的。![he图片说明](https://img-ask.csdn.net/upload/201905/15/1557909540_468021.png) 这是headers 请求之前的 ![图片说明](https://img-ask.csdn.net/upload/201905/15/1557909657_878941.png) 这是抓包抓到的请求头

Python抓取程序无法运行

-*- coding: utf-8 -*- """ Spyder Editor This is a temporary script file. """ import requests as req import pandas as pd import matplotlib.pyplot as plt import time import re url='http://wenshu.court.gov.cn/List/ListContent' Index=1 SleepNum= 3 dates=[] titles=[] nums=[] while (Index < 123): my_headers={'User-Agent':'User-Agent:Mozilla/5.0(Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.95Safari/537.36 Core/1.50.1280.400',} data={'Param':'全文检索:执行', 'Index': Index,'Page':'20','Order':'裁判日期', 'Direction':'asc'} r=req.post(url,headers=my_headers, data = data) raw=r.json() pattern1= re.compile('"裁判日期":"(.*?)"',re.S) date= re.findall(pattern1,raw) pattern2= re.compile('"案号":"(.*?)"',re.S) num= re.findall(pattern2,raw) pattern3= re.compile('"案件名称":"(.*?)"',re.S) title= re.findall(pattern3,raw) dates+=date titles+=title nums+=num time.sleep(SleepNum) Index+= 1 df=pd.DataFrame({'时间':dates,'案号':nums, '案件名称':titles}) df.to_excel('E:\result.xlsx')  console内容: Python 2.7.11 |Anaconda 4.1.0 (64-bit)| (default, Jun 15 2016, 15:21:11) [MSC v.1500 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information. IPython 4.2.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. %guiref -> A brief reference about the graphical user interface. runfile('C:/Users/xx/.spyder2/temp.py', wdir='C:/Users/xx/.spyder2') Traceback (most recent call last): File "", line 1, in runfile('C:/Users/xx/.spyder2/temp.py', wdir='C:/Users/xx/.spyder2') File "D:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile execfile(filename, namespace) File "D:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc) File "C:/Users/xx/.spyder2/temp.py", line 50, in df.to_excel('E:\result.xlsx') File "D:\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1427, in to_excel excel_writer.save() File "D:\Anaconda2\lib\site-packages\pandas\io\excel.py", line 1444, in save return self.book.close() File "D:\Anaconda2\lib\site-packages\xlsxwriter\workbook.py", line 297, in close self._store_workbook() File "D:\Anaconda2\lib\site-packages\xlsxwriter\workbook.py", line 605, in _store_workbook xml_files = packager._create_package() File "D:\Anaconda2\lib\site-packages\xlsxwriter\packager.py", line 139, in _create_package self._write_shared_strings_file() File "D:\Anaconda2\lib\site-packages\xlsxwriter\packager.py", line 286, in _write_shared_strings_file sst._assemble_xml_file() File "D:\Anaconda2\lib\site-packages\xlsxwriter\sharedstrings.py", line 53, in _assemble_xml_file self._write_sst_strings() File "D:\Anaconda2\lib\site-packages\xlsxwriter\sharedstrings.py", line 83, in _write_sst_strings self._write_si(string) File "D:\Anaconda2\lib\site-packages\xlsxwriter\sharedstrings.py", line 110, in _write_si self._xml_si_element(string, attributes) File "D:\Anaconda2\lib\site-packages\xlsxwriter\xmlwriter.py", line 122, in _xml_si_element self.fh.write("""%s""" % (attr, string)) File "D:\Anaconda2\lib\codecs.py", line 706, in write return self.writer.write(data) File "D:\Anaconda2\lib\codecs.py", line 369, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 7: ordinal not in range(128) runfile('C:/Users/xx/.spyder2/temp.py', wdir='C:/Users/xx/.spyder2')ERROR: execution aborted

一个百度拇指医生爬虫,想要先实现爬取某个问题的所有链接,但是爬不出来东西。求各位大神帮忙看一下这是为什么?

#写在前面的话 在这个爬虫里我想实现把百度拇指医生里关于“咳嗽”的链接全部爬取下来,下一步要进行的是把爬取到的每个链接里的items里面的内容爬取下来,但是我在第一步就卡住了,求各位大神帮我看一下吧。之前刚刚发了一篇问答,但是不知道怎么回事儿,现在找不到了,(貌似是被删了...?)救救小白吧!感激不尽! 这个是我的爬虫的结构 ![图片说明](https://img-ask.csdn.net/upload/201911/27/1574787999_274479.png) ##ks: ``` # -*- coding: utf-8 -*- import scrapy from kesou.items import KesouItem from scrapy.selector import Selector from scrapy.spiders import Spider from scrapy.http import Request ,FormRequest import pymongo class KsSpider(scrapy.Spider): name = 'ks' allowed_domains = ['kesou,baidu.com'] start_urls = ['https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M'] def parse(self, response): item = KesouItem() contents = response.xpath('.//h3[@class="t"]') for content in contents: url = content.xpath('.//a/@href').extract()[0] item['url'] = url yield item if self.offset < 760: self.offset += 10 yield scrapy.Request(url = "https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=" + str(self.offset) + "&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M",callback=self.parse,dont_filter=True) ``` ##items: ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class KesouItem(scrapy.Item): # 问题ID question_ID = scrapy.Field() # 问题描述 question = scrapy.Field() # 医生回答发表时间 answer_pubtime = scrapy.Field() # 问题详情 description = scrapy.Field() # 医生姓名 doctor_name = scrapy.Field() # 医生职位 doctor_title = scrapy.Field() # 医生所在医院 hospital = scrapy.Field() ``` ##middlewares: ``` # -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class KesouSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class KesouDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) ``` ##piplines: ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy.utils.project import get_project_settings settings = get_project_settings() class KesouPipeline(object): def __init__(self): host = settings["MONGODB_HOST"] port = settings["MONGODB_PORT"] dbname = settings["MONGODB_DBNAME"] sheetname= settings["MONGODB_SHEETNAME"] # 创建MONGODB数据库链接 client = pymongo.MongoClient(host = host, port = port) # 指定数据库 mydb = client[dbname] # 存放数据的数据库表名 self.sheet = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.sheet.insert(data) return item ``` ##settings: ``` # -*- coding: utf-8 -*- # Scrapy settings for kesou project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'kesou' SPIDER_MODULES = ['kesou.spiders'] NEWSPIDER_MODULE = 'kesou.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'kesou (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0" # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'kesou.middlewares.KesouSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'kesou.middlewares.KesouDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'kesou.pipelines.KesouPipeline': 300, } # MONGODB 主机名 MONGODB_HOST = "127.0.0.1" # MONGODB 端口号 MONGODB_PORT = 27017 # 数据库名称 MONGODB_DBNAME = "ks" # 存放数据的表名称 MONGODB_SHEETNAME = "ks_urls" # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ``` ##run.py: ``` # -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl ks".split()) ``` ##这个是运行出来的结果: ``` PS D:\scrapy_project\kesou> scrapy crawl ks 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: kesou) 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twis.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryphy 2.6.1, Platform Windows-10-10.0.18362-SP0 2019-11-27 00:14:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'kesou', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['kesou.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67 2019-11-27 00:14:17 [scrapy.extensions.telnet] INFO: Telnet Password: 051629c46f34abdf 2019-11-27 00:14:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled item pipelines: ['kesou.pipelines.KesouPipeline'] 2019-11-27 00:14:19 [scrapy.core.engine] INFO: Spider opened 2019-11-27 00:14:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-27 00:14:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-27 00:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX% (referer: None) 2019-11-27 00:14:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFFSYX%2B1M> (referer: None) Traceback (most recent call last): File "d:\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "D:\scrapy_project\kesou\kesou\spiders\ks.py", line 19, in parse item['url'] = url File "d:\anaconda3\lib\site-packages\scrapy\item.py", line 73, in __setitem__ (self.__class__.__name__, key)) KeyError: 'KesouItem does not support field: url' 2019-11-27 00:14:20 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2019, 11, 26, 16, 14, 19, 863597)} 2019-11-27 00:14:21 [scrapy.core.engine] INFO: Spider closed (finished) ```

python爬取豆瓣电影 一直报错 求解决

==== RESTART: C:\Users\123\AppData\Local\Programs\Python\Python36\类的学习.py ==== Traceback (most recent call last): File "C:\Users\123\AppData\Local\Programs\Python\Python36\类的学习.py", line 29, in <module> movies_list=get_review(getHtmlText(url)) File "C:\Users\123\AppData\Local\Programs\Python\Python36\类的学习.py", line 20, in get_review dict['name']=tag_li.find('span','titlt')[0].string TypeError: 'NoneType' object is not subscriptable ——代码如下————————————————————————————— import requests from bs4 import BeautifulSoup import bs4 def getHtmlText(url): try: r = requests.get(url, timeout = 30); r.raise_for_status(); r.encoding = r.apparent_encoding; return r.text; except: return "" def get_review(html): movies_list=[] soup=BeautifulSoup(html,"html.parser") soup=soup.find('ol','grid_view') for tag_li in soup.find_all('li'): dict={} dict['rank']=tag_li.find('em').string dict['name']=tag_li.find('span','titlt')[0].string dict['score']=tag_li.find('span','rating_num').string if(tag_li.find('span','inq')): dict['desc']=tag_li.find('span','inq').string movies_list.append(dict) return movies_list if __name__=='__main__': for i in range(10): url='http://movie.douban.com/top250?start=%s&filter=&type=' %(i*25) movies_list=get_review(getHtmlText(url)) for movie_dict in movies_list: print('电影排名:'+movie_dict['rank']) print('电影名称:'+movie_dict.get('name')) print('电影评分:'+movie_dict.get('score')) print('电影评词:'+movie_dict.get('desc','无评词')) print('------------------------------------------------------')

Scrapy Scraper问题

<div class="post-text" itemprop="text"> <p>I am trying to use Scrapy to scrape - <a href="http://www.paytm.com" rel="nofollow">www.paytm.com</a> . The website uses AJAX Requests, in the form of XHR to display search results. </p> <p>I managed to track down the XHR, and the AJAX response is SIMILAR to JSON, but it isn't actually JSON. </p> <p>This is the link for one of the XHR request - <a href="https://search.paytm.com/search/?page_count=2&amp;userQuery=tv&amp;items_per_page=30&amp;resolution=960x720&amp;quality=high&amp;q=tv&amp;cat_tree=1&amp;callback=angular.callbacks._6" rel="nofollow">https://search.paytm.com/search/?page_count=2&amp;userQuery=tv&amp;items_per_page=30&amp;resolution=960x720&amp;quality=high&amp;q=tv&amp;cat_tree=1&amp;callback=angular.callbacks._6</a> . If you see the URL correctly, The parameter - <b>page_count</b> - is responsible for showing different pages of results, and the parameter - <b>userQuery</b> - is responsible for the search query that is passed to the website. </p> <p>Now, if you see the response correctly. It isn't actually JSON, only looks similar to JSON ( I veified it on <a href="http://jsonlint.com/" rel="nofollow">http://jsonlint.com/</a> ) . I want to scrape this using SCRAPY <em>( SCRAPY only because since it is a framework, it would be faster than using other libraries like BeautifulSoup, because using them to create a scraper that scrapes at such a high speed would take a lot effort - That is the only reason why I want to use Scrapy. )</em> . </p> <p>Now, This is my snippet of code, that I used to extract the JSON Response from the URL -:</p> <pre><code> jsonresponse = json.loads(response.body_as_unicode()) print json.dumps(jsonresponse, indent=4, sort_keys=True) </code></pre> <p>On executing the code, it throws me an error stating-:</p> <pre><code>2015-07-05 12:13:23 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-07-05 12:13:23 [scrapy] INFO: Optional features available: ssl, http11 2015-07-05 12:13:23 [scrapy] INFO: Overridden settings: {'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'CONCURRENT_REQUESTS': 100} 2015-07-05 12:13:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-07-05 12:13:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-07-05 12:13:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-07-05 12:13:23 [scrapy] INFO: Enabled item pipelines: 2015-07-05 12:13:23 [scrapy] INFO: Spider opened 2015-07-05 12:13:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-07-05 12:13:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-07-05 12:13:24 [scrapy] DEBUG: Crawled (200) &lt;GET https://search.paytm.com/search/?page_count=2&amp;userQuery=tv&amp;items_per_page=30&amp;resolution=960x720&amp;quality=high&amp;q=tv&amp;cat_tree=1&amp;callback=angular.callbacks._6&gt; (referer: None) 2015-07-05 12:13:24 [scrapy] ERROR: Spider error processing &lt;GET https://search.paytm.com/search/?page_count=2&amp;userQuery=tv&amp;items_per_page=30&amp;resolution=960x720&amp;quality=high&amp;q=tv&amp;cat_tree=1&amp;callback=angular.callbacks._6&gt; (referer: None) Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks current.result = callback(current.result, *args, **kw) File "Startup App/SCRAPERS/paytmscraper_scrapy/paytmspiderscript.py", line 111, in parse jsonresponse = json.loads(response.body_as_unicode()) File "/usr/lib/python2.7/json/__init__.py", line 338, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded 2015-07-05 12:13:24 [scrapy] INFO: Closing spider (finished) 2015-07-05 12:13:24 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 343, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 6483, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 7, 5, 6, 43, 24, 733187), 'log_count/DEBUG': 2, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2015, 7, 5, 6, 43, 23, 908135)} 2015-07-05 12:13:24 [scrapy] INFO: Spider closed (finished) </code></pre> <p>Now, my Question, How do I scrape such a response using Scrapy? If any other code is required, feel free to ask in the comments. I shall willingly give it! </p> <p>Please provide the entire code related to this. It would be well appreciated! Maybe some manipulation of the JSON Response (from python) (similar to string comparison) would also work for me, if it can help me scrape this! </p> <p>P.S: I can't modify the JSON Response manually (using hand) every time because this is the response that is given by the website. So, please suggest a programmatic (pythonic) way to do this. Preferably, I want to use Scrapy as my framework. </p> </div>

Python爬取出现Internal Server Error问题

``` import re import urllib.request fh=open('C:\\Users\\Hear-H\\Desktop\\汽车企业数据\\新建文件夹\\298.txt','w',encoding='utf-8') area='<li><span>公司地区</span>(.*?)</li>' area1=area.encode('utf-8') time='<span>成立时间</span>(.*?)</li>' time1=time.encode('utf-8') address='<span>地址</span>(.*?)</li>' address1=address.encode('utf-8') client='<p id=\"maintypicClient\">(.*?)</p>' product='<p id=\"product\">(.*?)</p>' i=0 pat='<a target=\"_blank\" href=\"(http://i.gasgoo.com/supplier/.*?)\">' headers=('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36') opener=urllib.request.build_opener() opener.addheaders=[headers] while i<100: i+=1 url="http://i.gasgoo.com/supplier/c-298/index-"+str(i)+".html" web=opener.open(url).read().decode('utf-8') rst=re.compile(pat).findall(web) rst1=list() for a in rst: if a not in rst1: rst1.append(a) rst1.pop(0) for b in rst1: pat1=b+'\">(.*?)</a>' name=re.compile(pat1).findall(web) name_d=''.join(name) url1=b website1=opener.open(url1).read().decode('utf-8').encode('utf-8') website2=opener.open(url1).read().decode('utf-8') result1=re.compile(area1).findall(website1) for c in result1: result1_d=c.decode('utf-8') result2=re.compile(time1).findall(website1) for d in result2: result2_d=d.decode('utf-8') result3=re.compile(address1).findall(website1) for e in result3: result3_d=e.decode('utf-8') result4=re.compile(client).findall(str(website2)) result4_d=''.join(result4) result5=re.compile(product).findall(str(website2)) result5_d=''.join(result5) print(name_d+'?'+result1_d+'?'+result2_d+'?'+result3_d+'?'+result4_d+'?'+result5_d+'\n') fh1=fh.write(name_d+'?'+result1_d+'?'+result2_d+'?'+result3_d+'?'+result4_d+'?'+result5_d+'\n') fh.close ``` 就是我在爬取汽车企业数据网站的时候出现了HTTPError: Internal Server Error的问题,但是我上网查的时候一般说Internal Server Error出现的时候都会有500之类的数字提示,这里也没有,所以请问各位大佬一下,出现这种情况是不是只能用代理了呢?或者还有其他的方法 ``` Traceback (most recent call last): File "<ipython-input-1-7c05d0a2c578>", line 1, in <module> runfile('C:/Users/Hear-H/Desktop/汽车企业数据/汽车企业数据挖掘.py', wdir='C:/Users/Hear-H/Desktop/汽车企业数据') File "D:\Anaconda\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile execfile(filename, namespace) File "D:\Anaconda\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/Users/Hear-H/Desktop/汽车企业数据/汽车企业数据挖掘.py", line 39, in <module> website1=opener.open(url1).read().decode('utf-8').encode('utf-8') File "D:\Anaconda\lib\urllib\request.py", line 531, in open response = meth(req, response) File "D:\Anaconda\lib\urllib\request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "D:\Anaconda\lib\urllib\request.py", line 569, in error return self._call_chain(*args) File "D:\Anaconda\lib\urllib\request.py", line 503, in _call_chain result = func(*args) File "D:\Anaconda\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) HTTPError: Internal Server Error ```

Java基础知识面试题(2020最新版)

文章目录Java概述何为编程什么是Javajdk1.5之后的三大版本JVM、JRE和JDK的关系什么是跨平台性?原理是什么Java语言有哪些特点什么是字节码?采用字节码的最大好处是什么什么是Java程序的主类?应用程序和小程序的主类有何不同?Java应用程序与小程序之间有那些差别?Java和C++的区别Oracle JDK 和 OpenJDK 的对比基础语法数据类型Java有哪些数据类型switc...

软件测试入门、SQL、性能测试、测试管理工具

软件测试2小时入门,让您快速了解软件测试基本知识,有系统的了解; SQL一小时,让您快速理解和掌握SQL基本语法 jmeter性能测试 ,让您快速了解主流来源性能测试工具jmeter 测试管理工具-禅道,让您快速学会禅道的使用,学会测试项目、用例、缺陷的管理、

基于西门子S7—1200的单部六层电梯设计程序,1部6层电梯

基于西门子S7—1200的单部六层电梯设计程序,1部6层电梯。 本系统控制六层电梯, 采用集选控制方式。 为了完成设定的控制任务, 主要根据电梯输入/输出点数确定PLC 的机型。 根据电梯控制的要求,

捷联惯导仿真matlab

捷联惯导的仿真(包括轨迹仿真,惯性器件模拟输出,捷联解算),标了详细的注释捷联惯导的仿真(包括轨迹仿真,惯性器件模拟输出,捷联解算),标了详细的注释

深度学习原理+项目实战+算法详解+主流框架(套餐)

深度学习系列课程从深度学习基础知识点开始讲解一步步进入神经网络的世界再到卷积和递归神经网络,详解各大经典网络架构。实战部分选择当下最火爆深度学习框架PyTorch与Tensorflow/Keras,全程实战演示框架核心使用与建模方法。项目实战部分选择计算机视觉与自然语言处理领域经典项目,从零开始详解算法原理,debug模式逐行代码解读。适合准备就业和转行的同学们加入学习! 建议按照下列课程顺序来进行学习 (1)掌握深度学习必备经典网络架构 (2)深度框架实战方法 (3)计算机视觉与自然语言处理项目实战。(按照课程排列顺序即可)

图书管理系统(Java + Mysql)我的第一个完全自己做的实训项目

图书管理系统 Java + MySQL 完整实训代码,MVC三层架构组织,包含所有用到的图片资源以及数据库文件,大三上学期实训,注释很详细,按照阿里巴巴Java编程规范编写

玩转Linux:常用命令实例指南

人工智能、物联网、大数据时代,Linux正有着一统天下的趋势,几乎每个程序员岗位,都要求掌握Linux。本课程零基础也能轻松入门。 本课程以简洁易懂的语言手把手教你系统掌握日常所需的Linux知识,每个知识点都会配合案例实战让你融汇贯通。课程通俗易懂,简洁流畅,适合0基础以及对Linux掌握不熟练的人学习; 【限时福利】 1)购课后按提示添加小助手,进答疑群,还可获得价值300元的编程大礼包! 2)本月购买此套餐加入老师答疑交流群,可参加老师的免费分享活动,学习最新技术项目经验。 --------------------------------------------------------------- 29元=掌握Linux必修知识+社群答疑+讲师社群分享会+700元编程礼包。 &nbsp;

网络工程师小白入门--【思科CCNA、华为HCNA等网络工程师认证】

本课程适合CCNA或HCNA网络小白同志,高手请绕道,可以直接学习进价课程。通过本预科课程的学习,为学习网络工程师、思科CCNA、华为HCNA这些认证打下坚实的基础! 重要!思科认证2020年2月24日起,已启用新版认证和考试,包括题库都会更新,由于疫情原因,请关注官网和本地考点信息。题库网络上很容易下载到。

C++语言基础视频教程

C++语言基础视频培训课程:本课与主讲者在大学开出的程序设计课程直接对接,准确把握知识点,注重教学视频与实践体系的结合,帮助初学者有效学习。本教程详细介绍C++语言中的封装、数据隐藏、继承、多态的实现等入门知识;主要包括类的声明、对象定义、构造函数和析构函数、运算符重载、继承和派生、多态性实现等。 课程需要有C语言程序设计的基础(可以利用本人开出的《C语言与程序设计》系列课学习)。学习者能够通过实践的方式,学会利用C++语言解决问题,具备进一步学习利用C++开发应用程序的基础。

微信小程序 实例汇总 完整项目源代码

微信小程序 实例汇总 完整项目源代码

Python数据挖掘简易入门

&nbsp; &nbsp; &nbsp; &nbsp; 本课程为Python数据挖掘方向的入门课程,课程主要以真实数据为基础,详细介绍数据挖掘入门的流程和使用Python实现pandas与numpy在数据挖掘方向的运用,并深入学习如何运用scikit-learn调用常用的数据挖掘算法解决数据挖掘问题,为进一步深入学习数据挖掘打下扎实的基础。

2020-五一数学建模大赛C类问题饲料加工配比及优化.pdf

2020年,“51”数学建模C类问题,关于饲料配比问题以及加工优化方案。论文采用统计分析,建立了关于饲料加工的多目标优化模型。并利用蒙特卡罗算法对目标函数进行优化,解决了饲料加工质量最优配比问题并进行

MySQL数据库从入门到实战应用

限时福利1:购课进答疑群专享柳峰(刘运强)老师答疑服务 限时福利2:购课后添加学习助手(微信号:csdn590),按消息提示即可领取编程大礼包! 为什么说每一个程序员都应该学习MySQL? 根据《2019-2020年中国开发者调查报告》显示,超83%的开发者都在使用MySQL数据库。 使用量大同时,掌握MySQL早已是运维、DBA的必备技能,甚至部分IT开发岗位也要求对数据库使用和原理有深入的了解和掌握。 学习编程,你可能会犹豫选择 C++ 还是 Java;入门数据科学,你可能会纠结于选择 Python 还是 R;但无论如何, MySQL 都是 IT 从业人员不可或缺的技能! 【课程设计】 在本课程中,刘运强老师会结合自己十多年来对MySQL的心得体会,通过课程给你分享一条高效的MySQL入门捷径,让学员少走弯路,彻底搞懂MySQL。 本课程包含3大模块:&nbsp; 一、基础篇: 主要以最新的MySQL8.0安装为例帮助学员解决安装与配置MySQL的问题,并对MySQL8.0的新特性做一定介绍,为后续的课程展开做好环境部署。 二、SQL语言篇: 本篇主要讲解SQL语言的四大部分数据查询语言DQL,数据操纵语言DML,数据定义语言DDL,数据控制语言DCL,学会熟练对库表进行增删改查等必备技能。 三、MySQL进阶篇: 本篇可以帮助学员更加高效的管理线上的MySQL数据库;具备MySQL的日常运维能力,语句调优、备份恢复等思路。 &nbsp;

navicat简体中文版 绿色版 (64位)

解压后安装navicat,打开navicat执行PatchNavicat即破解成功。可以正常使用啦。

linux“开发工具三剑客”速成攻略

工欲善其事,必先利其器。Vim+Git+Makefile是Linux环境下嵌入式开发常用的工具。本专题主要面向初次接触Linux的新手,熟练掌握工作中常用的工具,在以后的学习和工作中提高效率。

机器学习初学者必会的案例精讲

通过六个实际的编码项目,带领同学入门人工智能。这些项目涉及机器学习(回归,分类,聚类),深度学习(神经网络),底层数学算法,Weka数据挖掘,利用Git开源项目实战等。

Python代码实现飞机大战

文章目录经典飞机大战一.游戏设定二.我方飞机三.敌方飞机四.发射子弹五.发放补给包六.主模块 经典飞机大战 源代码以及素材资料(图片,音频)可从下面的github中下载: 飞机大战源代码以及素材资料github项目地址链接 ————————————————————————————————————————————————————————— 不知道大家有没有打过飞机,喜不喜欢打飞机。当我第一次接触这个东西的时候,我的内心是被震撼到的。第一次接触打飞机的时候作者本人是身心愉悦的,因为周边的朋友都在打飞机, 每

java jdk 8 帮助文档 中文 文档 chm 谷歌翻译

JDK1.8 API 中文谷歌翻译版 java帮助文档 JDK API java 帮助文档 谷歌翻译 JDK1.8 API 中文 谷歌翻译版 java帮助文档 Java最新帮助文档 本帮助文档是使用谷

Qt5.10 GUI完全参考手册(强烈推荐)

本书是Qt中文版的参考手册,内容详尽易懂,详细介绍了Qt实现的各种内部原理,是一本不可多得的参考文献

Python可以这样学(第四季:数据分析与科学计算可视化)

董付国老师系列教材《Python程序设计(第2版)》(ISBN:9787302436515)、《Python可以这样学》(ISBN:9787302456469)配套视频,在教材基础上又增加了大量内容,通过实例讲解numpy、scipy、pandas、statistics、matplotlib等标准库和扩展库用法。

设计模式(JAVA语言实现)--20种设计模式附带源码

课程亮点: 课程培训详细的笔记以及实例代码,让学员开始掌握设计模式知识点 课程内容: 工厂模式、桥接模式、组合模式、装饰器模式、外观模式、享元模式、原型模型、代理模式、单例模式、适配器模式 策略模式、模板方法模式、观察者模式、迭代器模式、责任链模式、命令模式、备忘录模式、状态模式、访问者模式 课程特色: 笔记设计模式,用笔记串连所有知识点,让学员从一点一滴积累,学习过程无压力 笔记标题采用关键字标识法,帮助学员更加容易记住知识点 笔记以超链接形式让知识点关联起来,形式知识体系 采用先概念后实例再应用方式,知识点深入浅出 提供授课内容笔记作为课后复习以及工作备查工具 部分图表(电脑PC端查看):

进程监控软件 Performance Monitor中文版

告诉你每个程序现在在做什么,还可以根据你的要求过滤无关的内容。

八数码的深度优先算法c++实现

人工智能的八数码的深度优先算法c++实现

2021考研数学张宇基础30讲.pdf

张宇:博士,全国著名考研数学辅导专家,教育部“国家精品课程建设骨干教师”,全国畅销书《张宇高等数学18讲》《张宇线性代数9讲》《张宇概率论与数理统计9讲》《张宇考研数学题源探析经典1000题》《张宇考

2019 Python开发者日-培训

本次活动将秉承“只讲技术,拒绝空谈”的理念,邀请十余位身处一线的Python技术专家,重点围绕Web开发、自动化运维、数据分析、人工智能等技术模块,分享真实生产环境中使用Python应对IT挑战的真知灼见。此外,针对不同层次的开发者,大会还安排了深度培训实操环节,为开发者们带来更多深度实战的机会。

C/C++跨平台研发从基础到高阶实战系列套餐

一 专题从基础的C语言核心到c++ 和stl完成基础强化; 二 再到数据结构,设计模式完成专业计算机技能强化; 三 通过跨平台网络编程,linux编程,qt界面编程,mfc编程,windows编程,c++与lua联合编程来完成应用强化 四 最后通过基于ffmpeg的音视频播放器,直播推流,屏幕录像,

2020_五一数学建模_C题_整理后的数据.zip

该数据是我的程序读取的数据,仅供参考,问题的解决方案:https://blog.csdn.net/qq_41228463/article/details/105993051

机器学习实战系列套餐(必备基础+经典算法+案例实战)

机器学习实战系列套餐以实战为出发点,帮助同学们快速掌握机器学习领域必备经典算法原理并结合Python工具包进行实战应用。建议学习顺序:1.Python必备工具包:掌握实战工具 2.机器学习算法与实战应用:数学原理与应用方法都是必备技能 3.数据挖掘实战:通过真实数据集进行项目实战。按照下列课程顺序学习即可! 课程风格通俗易懂,用最接地气的方式带领大家轻松进军机器学习!提供所有课程代码,PPT与实战数据,有任何问题欢迎随时与我讨论。

实用主义学Python(小白也容易上手的Python实用案例)

原价169,限时立减100元! 系统掌握Python核心语法16点,轻松应对工作中80%以上的Python使用场景! 69元=72讲+源码+社群答疑+讲师社群分享会&nbsp; 【哪些人适合学习这门课程?】 1)大学生,平时只学习了Python理论,并未接触Python实战问题; 2)对Python实用技能掌握薄弱的人,自动化、爬虫、数据分析能让你快速提高工作效率; 3)想学习新技术,如:人工智能、机器学习、深度学习等,这门课程是你的必修课程; 4)想修炼更好的编程内功,优秀的工程师肯定不能只会一门语言,Python语言功能强大、使用高效、简单易学。 【超实用技能】 从零开始 自动生成工作周报 职场升级 豆瓣电影数据爬取 实用案例 奥运冠军数据分析 自动化办公:通过Python自动化分析Excel数据并自动操作Word文档,最终获得一份基于Excel表格的数据分析报告。 豆瓣电影爬虫:通过Python自动爬取豆瓣电影信息并将电影图片保存到本地。 奥运会数据分析实战 简介:通过Python分析120年间奥运会的数据,从不同角度入手分析,从而得出一些有趣的结论。 【超人气老师】 二两 中国人工智能协会高级会员 生成对抗神经网络研究者 《深入浅出生成对抗网络:原理剖析与TensorFlow实现》一书作者 阿里云大学云学院导师 前大型游戏公司后端工程师 【超丰富实用案例】 0)图片背景去除案例 1)自动生成工作周报案例 2)豆瓣电影数据爬取案例 3)奥运会数据分析案例 4)自动处理邮件案例 5)github信息爬取/更新提醒案例 6)B站百大UP信息爬取与分析案例 7)构建自己的论文网站案例

Python数据清洗实战入门

本次课程主要以真实的电商数据为基础,通过Python详细的介绍了数据分析中的数据清洗阶段各种技巧和方法。

2020美赛数学建模C题参考思路及可用代码.7z

2020美赛数学建模C题参考思路及可用代码2020美赛数学建模C题参考思路及可用代码2020美赛数学建模C题参考思路及可用代码2020美赛数学建模C题参考思路及可用代码2020美赛数学建模C题参考思路

程序员的算法通关课:知己知彼(第一季)

【超实用课程内容】 程序员对于算法一直又爱又恨!特别是在求职面试时,算法类问题绝对是不可逃避的提问点!本门课程作为算法面试系列的第一季,会从“知己知彼”的角度,聊聊关于算法面试的那些事~ 【哪些人适合学习这门课程?】 求职中的开发者,对于面试算法阶段缺少经验 想了解实际工作中算法相关知识 在职程序员,算法基础薄弱,急需充电 【超人气讲师】 孙秀洋&nbsp;| 服务器端工程师 硕士毕业于哈工大计算机科学与技术专业,ACM亚洲区赛铜奖获得者,先后在腾讯和百度从事一线技术研发,对算法和后端技术有深刻见解。 【课程如何观看?】 PC端:https://edu.csdn.net/course/detail/27272 移动端:CSDN 学院APP(注意不是CSDN APP哦) 本课程为录播课,课程无限观看时长,但是大家可以抓紧时间学习后一起讨论哦~

4小时玩转微信小程序——基础入门与微信支付实战

这是一个门针对零基础学员学习微信小程序开发的视频教学课程。课程采用腾讯官方文档作为教程的唯一技术资料来源。杜绝网络上质量良莠不齐的资料给学员学习带来的障碍。 视频课程按照开发工具的下载、安装、使用、程序结构、视图层、逻辑层、微信小程序等几个部分组织课程,详细讲解整个小程序的开发过程

java大作业图书馆管理系统(java与gui界面)附源码,sqlserver数据库,报告

java大作业,附源码,sqlserver数据库,报告一条龙服务,包你满意 下载链接:点我下载 登入模块功能实现 管理员模块功能介绍 1.查询图书 2.借阅图书 3.归还图书 4.删除图书 5.添加图书 6.删除用户 7.查询用户 用户模块功能介绍 1.查询图书 2.借阅图书 3.归还图书 4.注册 数据库表的设计 1.admin表 2.reader表 3.book表 4.book_information表 部分代码展示 /* * 登录事件处理 */ private int ad

利用 Python 爬取了 13966 条运维招聘信息,我得出了哪些结论?

作者:JackTian、黄伟呢 公众号:杰哥的IT之旅,后台回复:「运维」可获取本文完整数据 大家好,我是 JackTian。 我经常会收到读者关于一系列咨询运维方面的事情,比如:杰哥,运维到底是做什么的呀?运维的薪资水平/ 待遇怎么样呢?杰哥帮忙看下这个岗位的招聘需要对于小白来说,能否胜任的了呢?等等。 这里,我把之前写的《一篇文章带你解读从初级运维工程师到资深运维专家的学习路线》,本文从初级 / 中级 / 高级运维工程师以及到资深方向逐步展开给大家汇总了一些各阶段所具备的技能,仅供学习路线参考,如有.

winfrom中嵌套html,跟html的交互

winfrom中嵌套html,跟html的交互,源码就在里面一看就懂,很简单

阿里巴巴高级面试题(首发、高频136道、含答案)

整理的136道阿里的Java面试题,都来挑战一下,看看自己有多厉害。下面题目都带超详细的解答,详情见底部。 java基础 Arrays.sort实现原理和Collection实现原理 foreach和while的区别(编译之后) 线程池的种类,区别和使用场景 分析线程池的实现原理和线程的调度过程 线程池如何调优 线程池的最大线程数目根据什么确定 动态代理的几种方式 HashMap的并发问题 了解LinkedHashMap的应用吗 反射的原理,反射创建类实例的三种方式是什么? clon

R语言入门基础

本课程旨在帮助学习者快速入门R语言: 课程系统详细地介绍了使用R语言进行数据处理的基本思路和方法。 课程能够帮助初学者快速入门数据处理。 课程通过大量的案例详细地介绍了如何使用R语言进行数据分析和处理 课程操作实际案例教学,通过编写代码演示R语言的基本使用方法和技巧

相关热词 c#设计思想 c#正则表达式 转换 c#form复制 c#写web c# 柱形图 c# wcf 服务库 c#应用程序管理器 c#数组如何赋值给数组 c#序列化应用目的博客园 c# 设置当前标注样式
立即提问