scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

2个回答

图片说明

图片说明

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
scrapy配置问题,求大家帮忙啊
配置scrapy 我是按照http://blog.csdn.net/wukaibo1986/article/details/8167590配置的 创建项目可以 但是运行项目的时候报错,做的demo是按照 http://www.oschina.net/translate/scrapy-demo做的 求解释: E:\爬虫\tutorial>scrapy crawl dmoz 2013-11-20 11:09:50+0800 [scrapy] INFO: Scrapy 0.20.0 started (bot: tutorial) 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Optional features available: ssl, http11 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'} 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState Traceback (most recent call last): File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "C:\Python27\lib\runpy.py", line 72, in _run_code exec code in run_globals File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 168, in <module> execute() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 143, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 89, in _run_print_help func(*a, **kw) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 150, in _run_command cmd.run(args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\commands\crawl.py", line 50, in run self.crawler_process.start() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 92, in start if self.start_crawling(): File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 124, in start_crawling return self._start_crawler() is not None File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 139, in _start_crawler crawler.configure() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 47, in configure self.engine = ExecutionEngine(self, self._spider_closed) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\engine.py", line 63, in __init__ self.downloader = Downloader(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\__init__.py", line 73, in __init__ self.handlers = DownloadHandlers(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 18, in __init__ cls = load_object(clspath) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\utils\misc.py", line 40, in load_object mod = import_module(module) File "C:\Python27\lib\importlib\__init__.py", line 37, in import_module __import__(name) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\s3.py", line 4, in <module> from .http import HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http.py", line 5, in <module> from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 17, in <module> from scrapy.responsetypes import responsetypes File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 113, in <module> responsetypes = ResponseTypes() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 34, in __init__ self.mimetypes = MimeTypes() File "C:\Python27\lib\mimetypes.py", line 66, in __init__ init() File "C:\Python27\lib\mimetypes.py", line 358, in init db.read_windows_registry() File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry for subkeyname in enum_types(hkcr): File "C:\Python27\lib\mimetypes.py", line 249, in enum_types ctype = ctype.encode(default_encoding) # omit in 3.x! UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 9: ordinal not in range(128)
用anaconda的scrapy爬取数据,按照步骤设置好了,却爬不到数据,求助大神救救菜鸟
这是运行的全部结果: (D:\Anaconda2) C:\Users\luyue>cd C:\Users\luyue\movie250 (D:\Anaconda2) C:\Users\luyue\movie250>scrapy crawl movie250 -o items.json 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: movie250) 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'movie250.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['movie250.spiders'], 'BOT_NAME': 'movie250', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'} 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-05-12 19:24:26 [scrapy.core.engine] INFO: Spider opened 2017-05-12 19:24:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-05-12 19:24:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/robots.txt> (referer: None) 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/top250/> (referer: None) 2017-05-12 19:24:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://movie.douban.com/top250/>: HTTP status code is not handled or not allowed 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Closing spider (finished) 2017-05-12 19:24:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 445, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 496, 'downloader/response_count': 2, 'downloader/response_status_count/403': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 5, 12, 11, 24, 27, 13000), 'log_count/DEBUG': 3, 'log_count/INFO': 8, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 5, 12, 11, 24, 26, 675000)} 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Spider closed (finished)
scrapy运行爬虫时报错Missing scheme in request url
scrapy刚入门小白一枚。用网上的案例代码来玩一玩,案例是http://blog.csdn.net/czl389/article/details/77278166 中的爬取嘻哈歌词。这个案例下有三只爬虫,分别是songurls,lyrics和songinfo。我用songurls爬虫能从虾米音乐上爬取了url并保存在SongUrls.csv中,但是在用lyrics爬虫的时候会报错。信息如下 **D:\xiami2\xiami2>scrapy crawl lyrics -o Lyrics.csv 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiami2) 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiami2.spiders', 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 4.0; Trident/3.0)', 'FEED_URI': 'Lyrics.csv', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 0.2, 'SPIDER_MODULES': ['xiami2.spiders'], 'BOT_NAME': 'xiami2'} 2017-10-21 21:13:29 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled item pipelines: ['xiami2.pipelines.Xiami2Pipeline'] 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider opened 2017-10-21 21:13:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-21 21:13:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-21 21:13:31 [scrapy.core.engine] ERROR: Error while obtaining start requests Traceback (most recent call last): File "d:\python3.5\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request request = next(slot.start_requests) File "d:\python3.5\lib\site-packages\scrapy\spiders\__init__.py", line 83, in start_requests yield Request(url, dont_filter=True) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__ self._set_url(url) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Closing spider (finished) 2017-10-21 21:13:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 567323), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 536236)} 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider closed (finished) _------------------------------分割线--------------------------------------_ 我去查看了一下_init_.py,发现如下语句。 if ':' not in self._url: raise ValueError('Missing scheme in request url: %s' % self._url) 网上的解决方法看了一些,都没有能解决我的问题的,因此在此讨教,望大家指点一二(真没C币了)。提问次数不多,若有格式方面缺陷还请包含。 另附上代码。 #songurls.py import scrapy import re from scrapy.spiders import CrawlSpider, Rule from ..items import SongUrlItem class SongurlsSpider(scrapy.Spider): name = 'songurls' allowed_domains = ['xiami.com'] #将page/1到page/401,这些链接放进start_urls start_url_list=[] url_fixed='http://www.xiami.com/song/tag/Hip-Hop/page/' #将range范围扩大为1-401,获得所有页面 for i in range(1,402): start_url_list.extend([url_fixed+str(i)]) start_urls=start_url_list def parse(self,response): urls=response.xpath('//*[@id="wrapper"]/div[2]/div/div/div[2]/table/tbody/tr/td[2]/a[1]/@href').extract() for url in urls: song_url=response.urljoin(url) url_item=SongUrlItem() url_item['song_url']=song_url yield url_item ------------------------------分割线-------------------------------------- #lyrics.py import scrapy import re class LyricsSpider(scrapy.Spider): name = 'lyrics' allowed_domains = ['xiami.com'] song_url_file='SongUrls.csv' def __init__(self, *args, **kwargs): #从song_url.csv 文件中读取得到所有歌曲url f = open(self.song_url_file,"r") lines = f.readlines() #这里line[:-1]的含义是每行末尾都是一个换行符,要去掉 #这里in lines[1:]的含义是csv第一行是字段名称,要去掉 song_url_list=[line[:-1] for line in lines[1:]] f.close() while '\n' in song_url_list: song_url_list.remove('\n') self.start_urls = song_url_list#[:100]#删除[:100]之后爬取全部数据 def parse(self,response): lyric_lines=response.xpath('//*[@id="lrc"]/div[1]/text()').extract() lyric='' for lyric_line in lyric_lines: lyric+=lyric_line #print lyric lyricItem=LyricItem() lyricItem['lyric']=lyric lyricItem['song_url']=response.url yield lyricItem songinfo因为还没有用到所以不重要。 ------------------------------分割线-------------------------------------- #items.py import scrapy class SongUrlItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 class LyricItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() lyric=scrapy.Field() #歌词 song_url=scrapy.Field() #歌曲链接 class SongInfoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 song_title=scrapy.Field() #歌名 album=scrapy.Field() #专辑 #singer=scrapy.Field() #歌手 language=scrapy.Field() #语种 ------------------------------分割线-------------------------------------- 在middleware下加了几行: sleep_seconds = 0.2 # 模拟点击后休眠3秒,给出浏览器取得响应内容的时间 default_sleep_seconds = 1 # 无动作请求休眠的时间 def process_request(self, request, spider): spider.logger.info('--------Spider request processed: %s' % spider.name) page = None driver = webdriver.PhantomJS() spider.logger.info('--------request.url: %s' % request.url) driver.get(request.url) driver.implicitly_wait(0.2) # 仅休眠数秒加载页面后返回内容 time.sleep(self.sleep_seconds) page = driver.page_source driver.close() return HtmlResponse(request.url, body=page, encoding='utf-8', request=request) ------------------------------分割线-------------------------------------- setting中加了几行也改了几行: from faker import Factory f = Factory.create() USER_AGENT = f.user_agent() DOWNLOAD_DELAY = 0.2 DEFAULT_REQUEST_HEADERS = { 'Host': 'www.xiami.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'no-cache', 'Connection': 'Keep-Alive', } ITEM_PIPELINES = { 'xiami2.pipelines.Xiami2Pipeline': 300, }
为什么我用scrapy爬取谷歌应用市场却爬取不到内容?
我想用scrapy爬取谷歌应用市场,代码没有报错,但是却爬取不到内容,这是为什么? ``` # -*- coding: utf-8 -*- import scrapy # from scrapy.spiders import CrawlSpider, Rule # from scrapy.linkextractors import LinkExtractor from gp.items import GpItem # from html.parser import HTMLParser as SGMLParser import requests class GoogleSpider(scrapy.Spider): name = 'google' allowed_domains = ['https://play.google.com/'] start_urls = ['https://play.google.com/store/apps/'] ''' rules = [ Rule(LinkExtractor(allow=("https://play\.google\.com/store/apps/details",)), callback='parse_app', follow=True), ] ''' def parse(self, response): selector = scrapy.Selector(response) urls = selector.xpath('//a[@class="LkLjZd ScJHi U8Ww7d xjAeve nMZKrb id-track-click"]/@href').extract() link_flag = 0 links = [] for link in urls: links.append(link) for each in urls: yield scrapy.Request(links[link_flag], callback=self.parse_next, dont_filter=True) link_flag += 1 def parse_next(self, response): selector = scrapy.Selector(response) app_urls = selector.xpath('//div[@class="details"]/a[@class="title"]/@href').extract() print(app_urls) urls = [] for url in app_urls: url = "http://play.google.com" + url print(url) urls.append(url) link_flag = 0 for each in app_urls: yield scrapy.Request(urls[link_flag], callback=self.parse_app, dont_filter=True) link_flag += 1 def parse_app(self, response): item = GpItem() item['app_url'] = response.url item['app_name'] = response.xpath('//div[@itemprop="name"]').xpath('text()').extract() item['app_icon'] = response.xpath('//img[@itempro="image"]/@src') item['app_developer'] = response.xpath('//') print(response.text) yield item ``` terminal运行信息如下: ``` BettyMacbookPro-764:gp zhanjinyang$ scrapy crawl google 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: gp) 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.1 (default, Dec 14 2018, 13:28:58) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Darwin-18.5.0-x86_64-i386-64bit 2019-11-12 08:46:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'gp', 'NEWSPIDER_MODULE': 'gp.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['gp.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'} 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet Password: b2d7dedf1f4a91eb 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled item pipelines: ['gp.pipelines.GpPipeline'] 2019-11-12 08:46:45 [scrapy.core.engine] INFO: Spider opened 2019-11-12 08:46:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-12 08:46:45 [py.warnings] WARNING: /anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://play.google.com/ in allowed_domains. warnings.warn(message, URLWarning) 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-12 08:46:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/robots.txt> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/store/apps/> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-12 08:46:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 810, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 232419, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 12, 8, 46, 46, 474543), 'log_count/DEBUG': 2, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'memusage/max': 58175488, 'memusage/startup': 58175488, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2019, 11, 12, 8, 46, 45, 562775)} 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Spider closed (finished) ``` 求助!!!
请问scrapy为什么会爬取失败
C:\Users\Administrator\Desktop\新建文件夹\xiaozhu>python -m scrapy crawl xiaozhu 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: xiaozhu) 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9 .5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.5.3 (v 3.5.3:1880cb95a742, Jan 16 2017, 15:51:26) [MSC v.1900 32 bit (Intel)], pyOpenSS L 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-7-6.1 .7601-SP1 2019-10-26 11:43:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'xi aozhu', 'SPIDER_MODULES': ['xiaozhu.spiders'], 'NEWSPIDER_MODULE': 'xiaozhu.spid ers'} 2019-10-26 11:43:11 [scrapy.extensions.telnet] INFO: Telnet Password: c61bda45d6 3b8138 2019-10-26 11:43:11 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider opened 2019-10-26 11:43:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2019-10-26 11:43:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-10-26 11:43:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting ( 307) to <GET https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozh u.com%2Ffangzi%2F125535477903.html> from <GET http://bj.xiaozhu.com/fangzi/12553 5477903.html> 2019-10-26 11:43:12 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://bizve rify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2Ffangzi%2F125535477 903.html> (referer: None) 2019-10-26 11:43:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2 Ffangzi%2F125535477903.html>: HTTP status code is not handled or not allowed 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Closing spider (finished) 2019-10-26 11:43:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 529, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 725, 'downloader/response_count': 2, 'downloader/response_status_count/307': 1, 'downloader/response_status_count/400': 1, 'elapsed_time_seconds': 0.427734, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 889648), 'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/400': 1, 'log_count/DEBUG': 2, 'log_count/INFO': 11, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 461914)} 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider closed (finished)
一个百度拇指医生爬虫,想要先实现爬取某个问题的所有链接,但是爬不出来东西。求各位大神帮忙看一下这是为什么?
#写在前面的话 在这个爬虫里我想实现把百度拇指医生里关于“咳嗽”的链接全部爬取下来,下一步要进行的是把爬取到的每个链接里的items里面的内容爬取下来,但是我在第一步就卡住了,求各位大神帮我看一下吧。之前刚刚发了一篇问答,但是不知道怎么回事儿,现在找不到了,(貌似是被删了...?)救救小白吧!感激不尽! 这个是我的爬虫的结构 ![图片说明](https://img-ask.csdn.net/upload/201911/27/1574787999_274479.png) ##ks: ``` # -*- coding: utf-8 -*- import scrapy from kesou.items import KesouItem from scrapy.selector import Selector from scrapy.spiders import Spider from scrapy.http import Request ,FormRequest import pymongo class KsSpider(scrapy.Spider): name = 'ks' allowed_domains = ['kesou,baidu.com'] start_urls = ['https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M'] def parse(self, response): item = KesouItem() contents = response.xpath('.//h3[@class="t"]') for content in contents: url = content.xpath('.//a/@href').extract()[0] item['url'] = url yield item if self.offset < 760: self.offset += 10 yield scrapy.Request(url = "https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=" + str(self.offset) + "&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M",callback=self.parse,dont_filter=True) ``` ##items: ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class KesouItem(scrapy.Item): # 问题ID question_ID = scrapy.Field() # 问题描述 question = scrapy.Field() # 医生回答发表时间 answer_pubtime = scrapy.Field() # 问题详情 description = scrapy.Field() # 医生姓名 doctor_name = scrapy.Field() # 医生职位 doctor_title = scrapy.Field() # 医生所在医院 hospital = scrapy.Field() ``` ##middlewares: ``` # -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class KesouSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class KesouDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) ``` ##piplines: ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy.utils.project import get_project_settings settings = get_project_settings() class KesouPipeline(object): def __init__(self): host = settings["MONGODB_HOST"] port = settings["MONGODB_PORT"] dbname = settings["MONGODB_DBNAME"] sheetname= settings["MONGODB_SHEETNAME"] # 创建MONGODB数据库链接 client = pymongo.MongoClient(host = host, port = port) # 指定数据库 mydb = client[dbname] # 存放数据的数据库表名 self.sheet = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.sheet.insert(data) return item ``` ##settings: ``` # -*- coding: utf-8 -*- # Scrapy settings for kesou project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'kesou' SPIDER_MODULES = ['kesou.spiders'] NEWSPIDER_MODULE = 'kesou.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'kesou (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0" # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'kesou.middlewares.KesouSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'kesou.middlewares.KesouDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'kesou.pipelines.KesouPipeline': 300, } # MONGODB 主机名 MONGODB_HOST = "127.0.0.1" # MONGODB 端口号 MONGODB_PORT = 27017 # 数据库名称 MONGODB_DBNAME = "ks" # 存放数据的表名称 MONGODB_SHEETNAME = "ks_urls" # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ``` ##run.py: ``` # -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl ks".split()) ``` ##这个是运行出来的结果: ``` PS D:\scrapy_project\kesou> scrapy crawl ks 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: kesou) 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twis.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryphy 2.6.1, Platform Windows-10-10.0.18362-SP0 2019-11-27 00:14:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'kesou', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['kesou.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67 2019-11-27 00:14:17 [scrapy.extensions.telnet] INFO: Telnet Password: 051629c46f34abdf 2019-11-27 00:14:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled item pipelines: ['kesou.pipelines.KesouPipeline'] 2019-11-27 00:14:19 [scrapy.core.engine] INFO: Spider opened 2019-11-27 00:14:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-27 00:14:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-27 00:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX% (referer: None) 2019-11-27 00:14:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFFSYX%2B1M> (referer: None) Traceback (most recent call last): File "d:\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "D:\scrapy_project\kesou\kesou\spiders\ks.py", line 19, in parse item['url'] = url File "d:\anaconda3\lib\site-packages\scrapy\item.py", line 73, in __setitem__ (self.__class__.__name__, key)) KeyError: 'KesouItem does not support field: url' 2019-11-27 00:14:20 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2019, 11, 26, 16, 14, 19, 863597)} 2019-11-27 00:14:21 [scrapy.core.engine] INFO: Spider closed (finished) ```
scrapy 运行抛出NotImplementedError,请问一般什么原因造成呢?
/usr/bin/python3.5 /home/pzs/PycharmProjects/News/main.py 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: News) 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'News', 'SPIDER_MODULES': ['News.spiders'], 'NEWSPIDER_MODULE': 'News.spiders'} 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole',  'scrapy.extensions.corestats.CoreStats',  'scrapy.extensions.logstats.LogStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',  'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  'scrapy.spidermiddlewares.referer.RefererMiddleware',  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled item pipelines: ['News.pipelines.MysqlPipeline'] 2017-04-08 11:00:12 [scrapy.core.engine] INFO: Spider opened 2017-04-08 11:00:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 11:00:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-08 11:00:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://18.92.0.1/contents/7/121174.html> (referer: None) 2017-04-08 11:00:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://18.92.0.1/contents/7/121174.html> (referer: None) Traceback (most recent call last):   File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks     current.result = callback(current.result, *args, **kw)   File "/usr/local/lib/python3.5/dist-packages/scrapy/spiders/__init__.py", line 76, in parse     raise NotImplementedError NotImplementedError 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-08 11:00:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229,  'downloader/request_count': 1,  'downloader/request_method_count/GET': 1,  'downloader/response_bytes': 16609,  'downloader/response_count': 1,  'downloader/response_status_count/200': 1,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 4, 8, 18, 0, 13, 938637),  'log_count/DEBUG': 2,  'log_count/ERROR': 1,  'log_count/INFO': 7,  'response_received_count': 1,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'spider_exceptions/NotImplementedError': 1,  'start_time': datetime.datetime(2017, 4, 8, 18, 0, 12, 917719)} 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 直接运行会弹出NotImplementedError错误,单步调试也看不出到底哪里出了问题
scrapy出错twisted.python.failure.Failure twisted.internet.error
我是跟这网上视频写的 ``` import scrapy class QsbkSpider(scrapy.Spider): name = 'qsbk' allowed_domains = ['www.qiushibaike.com/'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): print('='*10) print(response) print('*'*10) ``` 出现 ![图片说明](https://img-ask.csdn.net/upload/201908/06/1565066344_519008.png) 请求头改了还是不行,用requests库爬取又可以
python Scrapy创建项目出错,执行scrapy startproject test 出错;
Traceback (most recent call last): File "d:\users\july_whj\lib\runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "d:\users\july_whj\lib\runpy.py", line 72, in _run_code exec code in run_globals File "D:\Users\July_whj\Scripts\scrapy.exe\__main__.py", line 5, in <module> File "d:\users\july_whj\lib\site-packages\scrapy\cmdline.py", line 9, in <module> from scrapy.crawler import CrawlerProcess File "d:\users\july_whj\lib\site-packages\scrapy\crawler.py", line 7, in <module> from twisted.internet import reactor, defer File "d:\users\july_whj\lib\site-packages\twisted\internet\reactor.py", line 38, in <module> from twisted.internet import default File "d:\users\july_whj\lib\site-packages\twisted\internet\default.py", line 56, in <module> install = _getInstallFunction(platform) File "d:\users\july_whj\lib\site-packages\twisted\internet\default.py", line 50, in _getInstallFunction from twisted.internet.selectreactor import install File "d:\users\july_whj\lib\site-packages\twisted\internet\selectreactor.py", line 18, in <module> from twisted.internet import posixbase File "d:\users\july_whj\lib\site-packages\twisted\internet\posixbase.py", line 18, in <module> from twisted.internet import error, udp, tcp File "d:\users\july_whj\lib\site-packages\twisted\internet\tcp.py", line 28, in <module> from twisted.internet._newtls import ( File "d:\users\july_whj\lib\site-packages\twisted\internet\_newtls.py", line 21, in <module> from twisted.protocols.tls import TLSMemoryBIOFactory, TLSMemoryBIOProtocol File "d:\users\july_whj\lib\site-packages\twisted\protocols\tls.py", line 63, in <module> from twisted.internet._sslverify import _setAcceptableProtocols File "d:\users\july_whj\lib\site-packages\twisted\internet\_sslverify.py", line 38, in <module> TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1, AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1' ![图片说明](https://img-ask.csdn.net/upload/201703/14/1489472404_160732.png)
非常简单的scrapy代码但就是不清楚到底哪里出问题了,高手帮忙看看吧!
News_spider文件 # -*- coding: utf-8 -*- import scrapy import re from scrapy import Selector from News.items import NewsItem class NewsSpiderSpider(scrapy.Spider): name = "news_spider" allowed_domains = ["http://18.92.0.1"] start_urls = ['http://18.92.0.1/contents/7/121174.html'] def parse_detail(self, response): sel = Selector(response) items = [] item = NewsItem() item['title'] = sel.css('.div_bt::text').extract()[0] characters = sel.css('.div_zz::text').extract()[0].replace("\xa0","") pattern = re.compile('[:].*[ ]') result = pattern.search(characters) item['post'] = result.group().replace(":","").strip() pattern = re.compile('[ ][^发]*') result = pattern.search(characters) item['approver'] = result.group() pattern = re.compile('[201].{9}') result = pattern.search(characters) item['date_of_publication'] = result.group() pattern = re.compile('([0-9]+)$') result = pattern.search(characters) item['browse_times'] = result.group() content = sel.css('.xwnr').extract()[0] pattern = re.compile('[\u4e00-\u9fa5]|[,、。“”]') result = pattern.findall(content) item['content'] = ''.join(result).replace("仿宋"," ").replace("宋体"," ").replace("楷体"," ") item['img1_url'] = sel.xpath('//*[@id="newpic"]/div[1]/div[1]/img/@src').extract()[0] item['img1_name'] = sel.xpath('//*[@id="newpic"]/div[1]/div[2]/text()').extract()[0] item['img2_url'] = sel.xpath('//*[@id="newpic"]/div[2]/div[1]/img/@src').extract()[0] item['img2_name'] = sel.xpath('//*[@id="newpic"]/div[2]/div[2]').extract()[0] item['img3_url'] = sel.xpath('//*[@id="newpic"]/div[3]/div[1]/img/@src').extract()[0] item['img3_name'] = sel.xpath('//*[@id="newpic"]/div[3]/div[2]/text()').extract()[0] item['img4_url'] = sel.xpath('//*[@id="newpic"]/div[4]/div[1]/img/@src').extract()[0] item['img4_name'] = sel.xpath('//*[@id="newpic"]/div[4]/div[2]/text()').extract()[0] item['img5_url'] = sel.xpath('//*[@id="newpic"]/div[5]/div[1]/img/@src').extract()[0] item['img5_name'] = sel.xpath('//*[@id="newpic"]/div[5]/div[2]/text()').extract()[0] item['img6_url'] = sel.xpath('//*[@id="newpic"]/div[6]/div[1]/img/@src').extract()[0] item['img6_name'] = sel.xpath('//*[@id="newpic"]/div[6]/div[2]/text()').extract()[0] characters = sel.xpath('/html/body/div/div[2]/div[4]/div[4]/text()').extract()[0].replace("\xa0","") pattern = re.compile('[:].*?[ ]') result = pattern.search(characters) item['company'] = result.group().replace(":", "").strip() pattern = re.compile('[ ][^联]*') result = pattern.search(characters) item['writer_photography'] = result.group() pattern = re.compile('(([0-9]|[-])+)$') result = pattern.search(characters) item['tel'] = result.group() items.append(item) items文件 return items import scrapy class NewsItem(scrapy.Item): title = scrapy.Field() post = scrapy.Field() approver = scrapy.Field() date_of_publication = scrapy.Field() browse_times = scrapy.Field() content = scrapy.Field() img1_url = scrapy.Field() img1_name = scrapy.Field() img2_url = scrapy.Field() img2_name = scrapy.Field() img3_url = scrapy.Field() img3_name = scrapy.Field() img4_url = scrapy.Field() img4_name = scrapy.Field() img5_url = scrapy.Field() img5_name = scrapy.Field() img6_url = scrapy.Field() img6_name = scrapy.Field() company = scrapy.Field() writer_photography = scrapy.Field() tel = scrapy.Field() pipelines文件 import MySQLdb import MySQLdb.cursors class NewsPipeline(object): def process_item(self, item, spider): return item class MysqlPipeline(object): def __init__(self): self.conn = MySQLdb.connect('192.168.254.129','root','root','news',charset="utf8",use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = "insert into news_table(title,post,approver,date_of_publication,browse_times,content,img1_url,img1_name,img2_url,img2_name,img3_url,img3_name,img4_url,img4_name,img5_url,img5_name,img6_url,img6_name,company,writer_photography,tel)VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)" self.cursor.execute(insert_sql,(item['title'],item['post'],item['approver'],item['date_of_publication'],item['browse_times'],item['content'],item['img1_url'],item['img1_name'],item['img1_url'],item['img1_name'],item['img2_url'],item['img2_name'],item['img3_url'],item['img3_name'],item['img4_url'],item['img4_name'],item['img5_url'],item['img5_name'],item['img6_url'],item['img6_name'],item['company'],item['writer_photography'],item['tel'])) self.conn.commit() setting文件 BOT_NAME = 'News' SPIDER_MODULES = ['News.spiders'] NEWSPIDER_MODULE = 'News.spiders' ROBOTSTXT_OBEY = False COOKIES_ENABLED = True ITEM_PIPELINES = { #'News.pipelines.NewsPipeline': 300, 'News.pipelines.MysqlPipeline': 300, } /usr/bin/python3.5 /home/pzs/PycharmProjects/News/main.py 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: News) 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'News', 'SPIDER_MODULES': ['News.spiders'], 'NEWSPIDER_MODULE': 'News.spiders'} 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled item pipelines: ['News.pipelines.MysqlPipeline'] 2017-04-08 11:00:12 [scrapy.core.engine] INFO: Spider opened 2017-04-08 11:00:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 11:00:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-08 11:00:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://18.92.0.1/contents/7/121174.html> (referer: None) 2017-04-08 11:00:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://18.92.0.1/contents/7/121174.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.5/dist-packages/scrapy/spiders/__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-08 11:00:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 16609, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 8, 18, 0, 13, 938637), 'log_count/DEBUG': 2, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 4, 8, 18, 0, 12, 917719)} 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 直接运行会弹出NotImplementedError错误,单步调试也看不出到底哪里出了问题
scrapy在多线程模式下,为每个线程设置独立的代理ip,并在后续请求不变,如何做到?
遇到的一个问题,需求是scrapy在middlewares.py中想为每个线程设置独立的ip 但是对方网站追踪cookie和ip地址为后续请求做验证,顾需要在第一次为每个线程设置完代理后,便不再改变使之持续如何做到,我现在可以做到的是为每个请求分配不同ip或者为所有请求分配同一个ip,没办法做到位每个线程分配不同ip并使之持续不变
为什么我直接用requests爬网页可以,但用scrapy不行?
``` class job51(): def __init__(self): self.headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start(self): html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers) self.parse(html) def parse(self,response): tree=lxml.etree.HTML(response.text) resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href') print (resume_url[0] ``` 能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面? ``` class job51(Spider): name = "job51" #allowed_domains = ["my.51job.com"] start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"] headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start_requests(self): yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse) def parse(self,response): #tree=lxml.etree.HTML(text) selector=Selector(response) print ("<<<<<<<<<<<<<<<<<<<<<",response.text) resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href') print (">>>>>>>>>>>>",resume_url) ``` 输出的结果: scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'} 2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened 2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None) 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) <<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script> >>>>>>>>>>>> [] 2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) Traceback (most recent call last): File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse yield Request(resume_url[0],headers=self.headers,callback=self.getResume) File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 628, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 5743, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)} 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)
scrapy-redis报错,这个真不知道什么原因,我之前写的另外一个爬虫是可以执行的
Traceback (most recent call last): File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) GeneratorExit Exception ignored in: <generator object iter_errback at 0x0000029D30248CA8> RuntimeError: generator ignored GeneratorExit Unhandled error in Deferred: 2018-07-17 17:50:00 [twisted] CRITICAL: Unhandled error in Deferred: 2018-07-17 17:50:00 [twisted] CRITICAL: Traceback (most recent call last): File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\twisted\internet\task.py", line 517, in _oneWorkUnit result = next(self._iterator) File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy\utils\defer.py", line 63, in <genexpr> work = (callable(elem, *args, **named) for elem in iterable) File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy\core\scraper.py", line 183, in _process_spidermw_output self.crawler.engine.crawl(request=output, spider=spider) File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy\core\engine.py", line 210, in crawl self.schedule(request, spider) File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy\core\engine.py", line 216, in schedule if not self.slot.scheduler.enqueue_request(request): File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy_redis\scheduler.py", line 167, in enqueue_request self.queue.push(request) File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy_redis\queue.py", line 99, in push data = self._encode_request(request) File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy_redis\queue.py", line 43, in _encode_request return self.serializer.dumps(obj) File "C:\Users\xin\Desktop\spider_thief\venv\lib\site-packages\scrapy_redis\picklecompat.py", line 14, in dumps return pickle.dumps(obj, protocol=-1) RecursionError: maximum recursion depth exceeded while calling a Python object
scrapy1.4.0版本保存数据为JSON格式的疑问
spider的代码 ``` from textsc.items import TextscItem from scrapy.selector import Selector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class Baispider(CrawlSpider): name = "Baidu" allowed_domains = ["baidu.com"] start_urls = [ "https://zhidao.baidu.com/list" ] rules = ( Rule(LinkExtractor(allow=('/shop', ), deny=('fr', )), callback='parse_item'), ) def parse_item(self, response): sel= Selector(response) items=[] item=TextscItem() title=sel.xpath('//div[@class="shop-menu"]/ul/li/a/text()').extract() for i in title: items.append(i) item['TitleName'] = items print (item['TitleName']) return item ``` item.py的代码 ``` import scrapy import json class TextscItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() TitleName = scrapy.Field() pass ``` 运行时输入了 ``` scrapy crawl Bai -o items.json -t json ``` 运行时很正常 没报错 但是运行后点击查看了 items.json文件 什么都没有 求解决方法 谢过.
有没有懂python scrapy代理ip的老哥?
一个困扰我好几天的问题:用scrapy写的一个访问58同城的简易爬虫,在中间件里爬了很多有效的代理IP,但是在process____request方法里,代理IP不知道为什么就是不切换,一直使用的是最初成功的那个IP,明明打印的信息是已经更换了新的IP,实际访问的结果来看却还是没有更换。。。 -----这是控制台的打印: ![图片说明](https://img-ask.csdn.net/upload/201909/27/1569590380_292339.png) 这是爬虫文件:xicispider.py name = 'xicispider' allowed_domains = ['58.com'] start_urls = ['https://www.58.com/'] def parse(self, response): reg = r'<title>(.*?)</title>' print(re.search(reg,response.text).group()) yield scrapy.Request(url='https://www.58.com',callback=self.parsep, dont_filter=True) def parsep(self, response): reg = r'<title>(.*?)</title>' print(re.search(reg,response.text).group()) 这是中间件:middleware.py def process_request(self,spider,request): ip = random.choice(self.proxies) print("process_request方法运行了,重新获取的ip是:--------->",ip) request.meta['proxy'] = ip 这是settings.py里的有关配置: DOWNLOADER_MIDDLEWARES = { 'xici.middlewares.XiciDM': 543, }
scrapy爬取知乎首页乱码
爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下: ``` import scrapy HEADERS = { 'Host': 'www.zhihu.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'Origin': 'https://www.zhihu.com', 'Referer': 'https://www.zhihu.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, headers=HEADERS) def parse(self, response): print('========== parse ==========') print(response.text[:100]) body = response.body encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1'] for encoding in encodings: try: print('========== decode ' + encoding) print(body.decode(encoding)[:100]) print('========== decode end\n') except Exception as e: print('########## decode {0}, error: {1}\n'.format(encoding, e)) pass ``` 输出的log如下: D:\workspace_python\ZhihuSpider>scrapy crawl zhihu 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider) 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']} 2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened 2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com/) ========== parse ========== ��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"�� z�I�zLgü^�1� Q)Ա�_k}�䄍���/T����U�3���l��� ========== decode utf-8 ########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte ========== decode gbk ########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode gb2312 ########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode iso-8859-1 áø~!¢ 同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。
Python爬虫,用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,有没有大神帮忙看一下,谢谢
用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,代理设置后还是提示需要登录。 settings内的FirstSplash.middlewares.FirstsplashSpiderMiddleware':823和FirstsplashSpiderMiddleware里面的 request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999"是从网上搜的,代理ip是从【西刺免费代理IP】这个网站随便找的一个,scrapy crawl Doubanbook打印出来的text 内容是需要登录。有没有大神帮忙看看,感谢!运行结果: ![图片说明](https://img-ask.csdn.net/upload/201904/25/1556181491_319319.jpg) <br>spider代码: ``` name = 'doubanBook' category = '' def start_requests(self): serachBook = ['python','scala','spark'] for x in serachBook: self.category = x start_urls = ['https://book.douban.com/subject_search', ] url=start_urls[0]+"?search_text="+x self.log("开始爬取:"+url) yield SplashRequest(url,self.parse_pre) def parse_pre(self, response): print(response.text) ``` 中间件代理配置: ``` class FirstsplashSpiderMiddleware(object): def process_request(self, request, spider): print("进入代理") print(request.meta['splash']['args']['proxy']) request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999" print(request.meta['splash']['args']['proxy']) ``` settings配置: ``` BOT_NAME = 'FirstSplash' SPIDER_MODULES = ['FirstSplash.spiders'] NEWSPIDER_MODULE = 'FirstSplash.spiders' ROBOTSTXT_OBEY = False #docker+scrapy-splash配置 FEED_EXPORT_ENCODING='utf-8' #doucer服务地址 SPLASH_URL = 'http://127.0.0.1:8050' # 去重过滤器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 使用Splash的Http缓存 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' #此处配置改为splash自带配置 SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } #下载器中间件改为splash自带配置 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 'FirstSplash.middlewares.FirstsplashSpiderMiddleware':823, } # 模拟浏览器请求头 DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', } ```
在Linux上跑scrapy项目 [scrapy.core.engine] DEBUG
![图片说明](https://img-ask.csdn.net/upload/201906/12/1560308593_973487.png) 困扰我很久了,想爬入下来之后存入本地数据库,可是这个问题一直困扰我,ROBOTSTXT_OBEY = False,useragent也设置了,不知道问题是什么,网上试过了很多方法了,希望大佬 ``` ```
python2有easy_install 但是却无法使用是怎么回事?
按照该连接操作:https://blog.csdn.net/qq_40678222/article/details/82734870 ``` hellopython@ubuntu:~$ easy_install --version setuptools 41.6.0 from /home/hellopython/.local/lib/python2.7/site-packages (Python 2.7) hellopython@ubuntu:~$ sudo easy_install virtualenvwrapper sudo: easy_install: command not found hellopython@ubuntu:~$ cd /home/hellopython/.local/lib/python2.7/site-packages hellopython@ubuntu:~/.local/lib/python2.7/site-packages$ ls decorator-4.4.1.dist-info retry decorator.py retry-0.9.2.dist-info decorator.pyc scrapy easy_install.py Scrapy-1.8.0.dist-info easy_install.pyc selenium ``` 请问这是为什么?python2里有easy_install.py ,但是却无法使用,是因为默认用python3解释器吗?怎么临时指定python2来执行呢?(不要长期)
爬虫福利二 之 妹子图网MM批量下载
爬虫福利一:27报网MM批量下载    点击 看了本文,相信大家对爬虫一定会产生强烈的兴趣,激励自己去学习爬虫,在这里提前祝:大家学有所成! 目标网站:妹子图网 环境:Python3.x 相关第三方模块:requests、beautifulsoup4 Re:各位在测试时只需要将代码里的变量 path 指定为你当前系统要保存的路径,使用 python xxx.py 或IDE运行即可。
Java学习的正确打开方式
在博主认为,对于入门级学习java的最佳学习方法莫过于视频+博客+书籍+总结,前三者博主将淋漓尽致地挥毫于这篇博客文章中,至于总结在于个人,实际上越到后面你会发现学习的最好方式就是阅读参考官方文档其次就是国内的书籍,博客次之,这又是一个层次了,这里暂时不提后面再谈。博主将为各位入门java保驾护航,各位只管冲鸭!!!上天是公平的,只要不辜负时间,时间自然不会辜负你。 何谓学习?博主所理解的学习,它
程序员必须掌握的核心算法有哪些?
由于我之前一直强调数据结构以及算法学习的重要性,所以就有一些读者经常问我,数据结构与算法应该要学习到哪个程度呢?,说实话,这个问题我不知道要怎么回答你,主要取决于你想学习到哪些程度,不过针对这个问题,我稍微总结一下我学过的算法知识点,以及我觉得值得学习的算法。这些算法与数据结构的学习大多数是零散的,并没有一本把他们全部覆盖的书籍。下面是我觉得值得学习的一些算法以及数据结构,当然,我也会整理一些看过
大学四年自学走来,这些私藏的实用工具/学习网站我贡献出来了
大学四年,看课本是不可能一直看课本的了,对于学习,特别是自学,善于搜索网上的一些资源来辅助,还是非常有必要的,下面我就把这几年私藏的各种资源,网站贡献出来给你们。主要有:电子书搜索、实用工具、在线视频学习网站、非视频学习网站、软件下载、面试/求职必备网站。 注意:文中提到的所有资源,文末我都给你整理好了,你们只管拿去,如果觉得不错,转发、分享就是最大的支持了。 一、PDF搜索网站推荐 对于大部
linux系列之常用运维命令整理笔录
本博客记录工作中需要的linux运维命令,大学时候开始接触linux,会一些基本操作,可是都没有整理起来,加上是做开发,不做运维,有些命令忘记了,所以现在整理成博客,当然vi,文件操作等就不介绍了,慢慢积累一些其它拓展的命令,博客不定时更新 顺便拉下票,我在参加csdn博客之星竞选,欢迎投票支持,每个QQ或者微信每天都可以投5票,扫二维码即可,http://m234140.nofollow.ax.
比特币原理详解
一、什么是比特币 比特币是一种电子货币,是一种基于密码学的货币,在2008年11月1日由中本聪发表比特币白皮书,文中提出了一种去中心化的电子记账系统,我们平时的电子现金是银行来记账,因为银行的背后是国家信用。去中心化电子记账系统是参与者共同记账。比特币可以防止主权危机、信用风险。其好处不多做赘述,这一层面介绍的文章很多,本文主要从更深层的技术原理角度进行介绍。 二、问题引入  假设现有4个人
程序员接私活怎样防止做完了不给钱?
首先跟大家说明一点,我们做 IT 类的外包开发,是非标品开发,所以很有可能在开发过程中会有这样那样的需求修改,而这种需求修改很容易造成扯皮,进而影响到费用支付,甚至出现做完了项目收不到钱的情况。 那么,怎么保证自己的薪酬安全呢? 我们在开工前,一定要做好一些证据方面的准备(也就是“讨薪”的理论依据),这其中最重要的就是需求文档和验收标准。一定要让需求方提供这两个文档资料作为开发的基础。之后开发
网页实现一个简单的音乐播放器(大佬别看。(⊙﹏⊙))
今天闲着无事,就想写点东西。然后听了下歌,就打算写个播放器。 于是乎用h5 audio的加上js简单的播放器完工了。 欢迎 改进 留言。 演示地点跳到演示地点 html代码如下`&lt;!DOCTYPE html&gt; &lt;html&gt; &lt;head&gt; &lt;title&gt;music&lt;/title&gt; &lt;meta charset="utf-8"&gt
Python十大装B语法
Python 是一种代表简单思想的语言,其语法相对简单,很容易上手。不过,如果就此小视 Python 语法的精妙和深邃,那就大错特错了。本文精心筛选了最能展现 Python 语法之精妙的十个知识点,并附上详细的实例代码。如能在实战中融会贯通、灵活使用,必将使代码更为精炼、高效,同时也会极大提升代码B格,使之看上去更老练,读起来更优雅。 1. for - else 什么?不是 if 和 else 才
数据库优化 - SQL优化
前面一篇文章从实例的角度进行数据库优化,通过配置一些参数让数据库性能达到最优。但是一些“不好”的SQL也会导致数据库查询变慢,影响业务流程。本文从SQL角度进行数据库优化,提升SQL运行效率。 判断问题SQL 判断SQL是否有问题时可以通过两个表象进行判断: 系统级别表象 CPU消耗严重 IO等待严重 页面响应时间过长
2019年11月中国大陆编程语言排行榜
2019年11月2日,我统计了某招聘网站,获得有效程序员招聘数据9万条。针对招聘信息,提取编程语言关键字,并统计如下: 编程语言比例 rank pl_ percentage 1 java 33.62% 2 c/c++ 16.42% 3 c_sharp 12.82% 4 javascript 12.31% 5 python 7.93% 6 go 7.25% 7
通俗易懂地给女朋友讲:线程池的内部原理
餐厅的约会 餐盘在灯光的照耀下格外晶莹洁白,女朋友拿起红酒杯轻轻地抿了一小口,对我说:“经常听你说线程池,到底线程池到底是个什么原理?”我楞了一下,心里想女朋友今天是怎么了,怎么突然问出这么专业的问题,但做为一个专业人士在女朋友面前也不能露怯啊,想了一下便说:“我先给你讲讲我前同事老王的故事吧!” 大龄程序员老王 老王是一个已经北漂十多年的程序员,岁数大了,加班加不动了,升迁也无望,于是拿着手里
经典算法(5)杨辉三角
写在前面: 我是 扬帆向海,这个昵称来源于我的名字以及女朋友的名字。我热爱技术、热爱开源、热爱编程。技术是开源的、知识是共享的。 这博客是对自己学习的一点点总结及记录,如果您对 Java、算法 感兴趣,可以关注我的动态,我们一起学习。 用知识改变命运,让我们的家人过上更好的生活。 目录一、杨辉三角的介绍二、杨辉三角的算法思想三、代码实现1.第一种写法2.第二种写法 一、杨辉三角的介绍 百度
腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹?
昨天,有网友私信我,说去阿里面试,彻底的被打击到了。问了为什么网上大量使用ThreadLocal的源码都会加上private static?他被难住了,因为他从来都没有考虑过这个问题。无独有偶,今天笔者又发现有网友吐槽了一道腾讯的面试题,我们一起来看看。 腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹? 在互联网职场论坛,一名程序员发帖求助到。二面腾讯,其中一个算法题:64匹
面试官:你连RESTful都不知道我怎么敢要你?
面试官:了解RESTful吗? 我:听说过。 面试官:那什么是RESTful? 我:就是用起来很规范,挺好的 面试官:是RESTful挺好的,还是自我感觉挺好的 我:都挺好的。 面试官:… 把门关上。 我:… 要干嘛?先关上再说。 面试官:我说出去把门关上。 我:what ?,夺门而去 文章目录01 前言02 RESTful的来源03 RESTful6大原则1. C-S架构2. 无状态3.统一的接
为啥国人偏爱Mybatis,而老外喜欢Hibernate/JPA呢?
关于SQL和ORM的争论,永远都不会终止,我也一直在思考这个问题。昨天又跟群里的小伙伴进行了一番讨论,感触还是有一些,于是就有了今天这篇文。 声明:本文不会下关于Mybatis和JPA两个持久层框架哪个更好这样的结论。只是摆事实,讲道理,所以,请各位看官勿喷。 一、事件起因 关于Mybatis和JPA孰优孰劣的问题,争论已经很多年了。一直也没有结论,毕竟每个人的喜好和习惯是大不相同的。我也看
SQL-小白最佳入门sql查询一
一 说明 如果是初学者,建议去网上寻找安装Mysql的文章安装,以及使用navicat连接数据库,以后的示例基本是使用mysql数据库管理系统; 二 准备前提 需要建立一张学生表,列分别是id,名称,年龄,学生信息;本示例中文章篇幅原因SQL注释略; 建表语句: CREATE TABLE `student` ( `id` int(11) NOT NULL AUTO_INCREMENT, `
项目中的if else太多了,该怎么重构?
介绍 最近跟着公司的大佬开发了一款IM系统,类似QQ和微信哈,就是聊天软件。我们有一部分业务逻辑是这样的 if (msgType = "文本") { // dosomething } else if(msgType = "图片") { // doshomething } else if(msgType = "视频") { // doshomething } else { // dosho
【图解经典算法题】如何用一行代码解决约瑟夫环问题
约瑟夫环问题算是很经典的题了,估计大家都听说过,然后我就在一次笔试中遇到了,下面我就用 3 种方法来详细讲解一下这道题,最后一种方法学了之后保证让你可以让你装逼。 问题描述:编号为 1-N 的 N 个士兵围坐在一起形成一个圆圈,从编号为 1 的士兵开始依次报数(1,2,3…这样依次报),数到 m 的 士兵会被杀死出列,之后的士兵再从 1 开始报数。直到最后剩下一士兵,求这个士兵的编号。 1、方
致 Python 初学者
文章目录1. 前言2. 明确学习目标,不急于求成,不好高骛远3. 在开始学习 Python 之前,你需要做一些准备2.1 Python 的各种发行版2.2 安装 Python2.3 选择一款趁手的开发工具3. 习惯使用IDLE,这是学习python最好的方式4. 严格遵从编码规范5. 代码的运行、调试5. 模块管理5.1 同时安装了py2/py35.2 使用Anaconda,或者通过IDE来安装模
“狗屁不通文章生成器”登顶GitHub热榜,分分钟写出万字形式主义大作
一、垃圾文字生成器介绍 最近在浏览GitHub的时候,发现了这样一个骨骼清奇的雷人项目,而且热度还特别高。 项目中文名:狗屁不通文章生成器 项目英文名:BullshitGenerator 根据作者的介绍,他是偶尔需要一些中文文字用于GUI开发时测试文本渲染,因此开发了这个废话生成器。但由于生成的废话实在是太过富于哲理,所以最近已经被小伙伴们给玩坏了。 他的文风可能是这样的: 你发现,
程序员:我终于知道post和get的区别
IT界知名的程序员曾说:对于那些月薪三万以下,自称IT工程师的码农们,其实我们从来没有把他们归为我们IT工程师的队伍。他们虽然总是以IT工程师自居,但只是他们一厢情愿罢了。 此话一出,不知激起了多少(码农)程序员的愤怒,却又无可奈何,于是码农问程序员。 码农:你知道get和post请求到底有什么区别? 程序员:你看这篇就知道了。 码农:你月薪三万了? 程序员:嗯。 码农:你是怎么做到的? 程序员:
《程序人生》系列-这个程序员只用了20行代码就拿了冠军
你知道的越多,你不知道的越多 点赞再看,养成习惯GitHub上已经开源https://github.com/JavaFamily,有一线大厂面试点脑图,欢迎Star和完善 前言 这一期不算《吊打面试官》系列的,所有没前言我直接开始。 絮叨 本来应该是没有这期的,看过我上期的小伙伴应该是知道的嘛,双十一比较忙嘛,要值班又要去帮忙拍摄年会的视频素材,还得搞个程序员一天的Vlog,还要写BU
加快推动区块链技术和产业创新发展,2019可信区块链峰会在京召开
      11月8日,由中国信息通信研究院、中国通信标准化协会、中国互联网协会、可信区块链推进计划联合主办,科技行者协办的2019可信区块链峰会将在北京悠唐皇冠假日酒店开幕。   区块链技术被认为是继蒸汽机、电力、互联网之后,下一代颠覆性的核心技术。如果说蒸汽机释放了人类的生产力,电力解决了人类基本的生活需求,互联网彻底改变了信息传递的方式,区块链作为构造信任的技术有重要的价值。   1
程序员把地府后台管理系统做出来了,还有3.0版本!12月7号最新消息:已在开发中有github地址
第一幕:缘起 听说阎王爷要做个生死簿后台管理系统,我们派去了一个程序员…… 996程序员做的梦: 第一场:团队招募 为了应对地府管理危机,阎王打算找“人”开发一套地府后台管理系统,于是就在地府总经办群中发了项目需求。 话说还是中国电信的信号好,地府都是满格,哈哈!!! 经常会有外行朋友问:看某网站做的不错,功能也简单,你帮忙做一下? 而这次,面对这样的需求,这个程序员
网易云6亿用户音乐推荐算法
网易云音乐是音乐爱好者的集聚地,云音乐推荐系统致力于通过 AI 算法的落地,实现用户千人千面的个性化推荐,为用户带来不一样的听歌体验。 本次分享重点介绍 AI 算法在音乐推荐中的应用实践,以及在算法落地过程中遇到的挑战和解决方案。 将从如下两个部分展开: AI 算法在音乐推荐中的应用 音乐场景下的 AI 思考 从 2013 年 4 月正式上线至今,网易云音乐平台持续提供着:乐屏社区、UGC
【技巧总结】位运算装逼指南
位算法的效率有多快我就不说,不信你可以去用 10 亿个数据模拟一下,今天给大家讲一讲位运算的一些经典例子。不过,最重要的不是看懂了这些例子就好,而是要在以后多去运用位运算这些技巧,当然,采用位运算,也是可以装逼的,不信,你往下看。我会从最简单的讲起,一道比一道难度递增,不过居然是讲技巧,那么也不会太难,相信你分分钟看懂。 判断奇偶数 判断一个数是基于还是偶数,相信很多人都做过,一般的做法的代码如下
日均350000亿接入量,腾讯TubeMQ性能超过Kafka
整理 | 夕颜出品 | AI科技大本营(ID:rgznai100) 【导读】近日,腾讯开源动作不断,相继开源了分布式消息中间件TubeMQ,基于最主流的 OpenJDK8开发的
8年经验面试官详解 Java 面试秘诀
    作者 | 胡书敏 责编 | 刘静 出品 | CSDN(ID:CSDNnews) 本人目前在一家知名外企担任架构师,而且最近八年来,在多家外企和互联网公司担任Java技术面试官,前后累计面试了有两三百位候选人。在本文里,就将结合本人的面试经验,针对Java初学者、Java初级开发和Java开发,给出若干准备简历和准备面试的建议。   Java程序员准备和投递简历的实
面试官如何考察你的思维方式?
1.两种思维方式在求职面试中,经常会考察这种问题:北京有多少量特斯拉汽车? 某胡同口的煎饼摊一年能卖出多少个煎饼? 深圳有多少个产品经理? 一辆公交车里能装下多少个乒乓球? 一
相关热词 c#选择结构应用基本算法 c# 收到udp包后回包 c#oracle 头文件 c# 序列化对象 自定义 c# tcp 心跳 c# ice连接服务端 c# md5 解密 c# 文字导航控件 c#注册dll文件 c#安装.net
立即提问