python爬虫:为什么用requests可以爬到数据,用scrapy爬到数据为空?

"http://detail.zol.com.cn/index.php?c=SearchList&keyword=%C8%FD%D0%C7&page=1"
用requests可以爬到数据,scrapy爬的状态码是200,但响应没有数据,什么原因?

2个回答

贴scrapy代码看看

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
为什么我直接用requests爬网页可以,但用scrapy不行?
``` class job51(): def __init__(self): self.headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start(self): html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers) self.parse(html) def parse(self,response): tree=lxml.etree.HTML(response.text) resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href') print (resume_url[0] ``` 能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面? ``` class job51(Spider): name = "job51" #allowed_domains = ["my.51job.com"] start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"] headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start_requests(self): yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse) def parse(self,response): #tree=lxml.etree.HTML(text) selector=Selector(response) print ("<<<<<<<<<<<<<<<<<<<<<",response.text) resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href') print (">>>>>>>>>>>>",resume_url) ``` 输出的结果: scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'} 2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened 2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None) 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) <<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script> >>>>>>>>>>>> [] 2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) Traceback (most recent call last): File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse yield Request(resume_url[0],headers=self.headers,callback=self.getResume) File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 628, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 5743, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)} 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)
请问scrapy爬虫使用代理的问题
我用scrapy爬虫来抓取数据,购买了一些代理,看scrapy文档上面介绍使用代理的话要编写Middleware,但是我不打算使用Middleware,我尝试了这样 ``` def start_requests(self): name = my_name password = password proxy = my proxy return[ FormRequest(url,formate={'account':my_name,'password':password}, meta={'proxy':proxy}, callback=self.after_login) ] def after_login(self, response): response.xpath ``` 但是返回了错误,请问各位老师如何不使用middleware然后可以使用代理?谢谢
python scrapy 爬虫图片新手求助
求问大神 我这个data她怎么了 报错: 2020-02-07 09:24:55 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: meizitu) 2020-02-07 09:24:55 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.17763-SP0 2020-02-07 09:24:55 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'meizitu', 'NEWSPIDER_MODULE': 'meizitu.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['meizitu.spiders']} 2020-02-07 09:24:55 [scrapy.extensions.telnet] INFO: Telnet Password: 0936097982b9bcc8 2020-02-07 09:24:55 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2020-02-07 09:24:56 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-02-07 09:24:56 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] Unhandled error in Deferred: 2020-02-07 09:24:56 [twisted] CRITICAL: Unhandled error in Deferred: Traceback (most recent call last): File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 184, in crawl return self._crawl(crawler, *args, **kwargs) File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 188, in _crawl d = crawler.crawl(*args, **kwargs) File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1613, in unwindGenerator return _cancellableInlineCallbacks(gen) File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1529, in _cancellableInlineCallbacks _inlineCallbacks(None, g, status) --- <exception caught here> --- File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 86, in crawl self.engine = self._create_engine() File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 111, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "e:\python3.7\lib\site-packages\scrapy\core\engine.py", line 70, in __init__ self.scraper = Scraper(crawler) File "e:\python3.7\lib\site-packages\scrapy\core\scraper.py", line 71, in __init__ self.itemproc = itemproc_cls.from_crawler(crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 53, in from_crawler return cls.from_settings(crawler.settings, crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 34, in from_settings mwcls = load_object(clspath) File "e:\python3.7\lib\site-packages\scrapy\utils\misc.py", line 46, in load_object mod = import_module(module) File "e:\python3.7\lib\importlib\__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 724, in exec_module File "<frozen importlib._bootstrap_external>", line 860, in get_code File "<frozen importlib._bootstrap_external>", line 791, in source_to_code File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed builtins.SyntaxError: unexpected EOF while parsing (pipelines.py, line 22) 2020-02-07 09:24:56 [twisted] CRITICAL: Traceback (most recent call last): File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 86, in crawl self.engine = self._create_engine() File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 111, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "e:\python3.7\lib\site-packages\scrapy\core\engine.py", line 70, in __init__ self.scraper = Scraper(crawler) File "e:\python3.7\lib\site-packages\scrapy\core\scraper.py", line 71, in __init__ self.itemproc = itemproc_cls.from_crawler(crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 53, in from_crawler return cls.from_settings(crawler.settings, crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 34, in from_settings mwcls = load_object(clspath) File "e:\python3.7\lib\site-packages\scrapy\utils\misc.py", line 46, in load_object mod = import_module(module) File "e:\python3.7\lib\importlib\__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 724, in exec_module File "<frozen importlib._bootstrap_external>", line 860, in get_code File "<frozen importlib._bootstrap_external>", line 791, in source_to_code File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "E:\python_work\爬虫\meizitu\meizitu\pipelines.py", line 22 f.write(data) ^ SyntaxError: unexpected EOF while parsing 代码如下: pipeline ``` import requests class MeizituPipeline(object): def process_item(self, item, spider): print("main_title:",item['main_title']) print("main_image:", item['main_image']) print("main_tags:", item['main_tags']) print("main_meta:", item['main_meta']) print("page:", item['main_pagenavi']) url = requests.get(item['main_image']) print(url) try: with open(item['main_pagenavi'] +'.jpg','wb') as f: data = url.read() f.write(data) ``` image.py ``` import scrapy from scrapy.http import response from ..items import MeizituItem class ImageSpider(scrapy.Spider): #定义Spider的名字scrapy crawl meiaitu name = 'SpiderMain' #允许爬虫的域名 allowed_domains = ['www.mzitu.com/203554'] #爬取的首页列表 start_urls = ['https://www.mzitu.com/203554'] #负责提取response的信息 #response代表下载器从start_urls中的url的到的回应 #提取的信息 def parse(self,response): #遍历所有节点 for Main in response.xpath('//div[@class = "main"]'): item = MeizituItem() #匹配所有节点元素/html/body/div[2]/div[1]/div[3]/p/a content = Main.xpath('//div[@class = "content"]') item['main_title'] = content.xpath('./h2/text()') item['main_image'] = content.xpath('./div[@class="main-image"]/p/a/img') item['main_meta'] = content.xpath('./div[@class="main-meta"]/span/text()').extract() item['main_tags'] = content.xpath('./div[@class="main-tags"]/a/text()').extract() item['main_pagenavi'] = content.xpath('./div[@class="main_pagenavi"]/span/text()').extract_first() yield item new_links = response.xpath('.//div[@class="pagenavi"]/a/@href').extract() new_link =new_links[-1] yield scrapy.Request(new_link,callback=self.parse) ``` setting ``` BOT_NAME = 'meizitu' SPIDER_MODULES = ['meizitu.spiders'] NEWSPIDER_MODULE = 'meizitu.spiders' ROBOTSTXT_OBEY = True #配置默认请求头 DEFAULT_REQUEST_HEADERS = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36", 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' } ITEM_PIPELINES = { 'meizitu.pipelines.MeizituPipeline':300, } IMAGES_STORE = 'E:\python_work\爬虫\meizitu' IMAGES_MIN_HEIGHT = 1050 IMAGES_MIN_WIDTH = 700 ```
scrapy运行爬虫时报错Missing scheme in request url
scrapy刚入门小白一枚。用网上的案例代码来玩一玩,案例是http://blog.csdn.net/czl389/article/details/77278166 中的爬取嘻哈歌词。这个案例下有三只爬虫,分别是songurls,lyrics和songinfo。我用songurls爬虫能从虾米音乐上爬取了url并保存在SongUrls.csv中,但是在用lyrics爬虫的时候会报错。信息如下 **D:\xiami2\xiami2>scrapy crawl lyrics -o Lyrics.csv 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiami2) 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiami2.spiders', 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 4.0; Trident/3.0)', 'FEED_URI': 'Lyrics.csv', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 0.2, 'SPIDER_MODULES': ['xiami2.spiders'], 'BOT_NAME': 'xiami2'} 2017-10-21 21:13:29 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled item pipelines: ['xiami2.pipelines.Xiami2Pipeline'] 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider opened 2017-10-21 21:13:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-21 21:13:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-21 21:13:31 [scrapy.core.engine] ERROR: Error while obtaining start requests Traceback (most recent call last): File "d:\python3.5\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request request = next(slot.start_requests) File "d:\python3.5\lib\site-packages\scrapy\spiders\__init__.py", line 83, in start_requests yield Request(url, dont_filter=True) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__ self._set_url(url) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Closing spider (finished) 2017-10-21 21:13:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 567323), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 536236)} 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider closed (finished) _------------------------------分割线--------------------------------------_ 我去查看了一下_init_.py,发现如下语句。 if ':' not in self._url: raise ValueError('Missing scheme in request url: %s' % self._url) 网上的解决方法看了一些,都没有能解决我的问题的,因此在此讨教,望大家指点一二(真没C币了)。提问次数不多,若有格式方面缺陷还请包含。 另附上代码。 #songurls.py import scrapy import re from scrapy.spiders import CrawlSpider, Rule from ..items import SongUrlItem class SongurlsSpider(scrapy.Spider): name = 'songurls' allowed_domains = ['xiami.com'] #将page/1到page/401,这些链接放进start_urls start_url_list=[] url_fixed='http://www.xiami.com/song/tag/Hip-Hop/page/' #将range范围扩大为1-401,获得所有页面 for i in range(1,402): start_url_list.extend([url_fixed+str(i)]) start_urls=start_url_list def parse(self,response): urls=response.xpath('//*[@id="wrapper"]/div[2]/div/div/div[2]/table/tbody/tr/td[2]/a[1]/@href').extract() for url in urls: song_url=response.urljoin(url) url_item=SongUrlItem() url_item['song_url']=song_url yield url_item ------------------------------分割线-------------------------------------- #lyrics.py import scrapy import re class LyricsSpider(scrapy.Spider): name = 'lyrics' allowed_domains = ['xiami.com'] song_url_file='SongUrls.csv' def __init__(self, *args, **kwargs): #从song_url.csv 文件中读取得到所有歌曲url f = open(self.song_url_file,"r") lines = f.readlines() #这里line[:-1]的含义是每行末尾都是一个换行符,要去掉 #这里in lines[1:]的含义是csv第一行是字段名称,要去掉 song_url_list=[line[:-1] for line in lines[1:]] f.close() while '\n' in song_url_list: song_url_list.remove('\n') self.start_urls = song_url_list#[:100]#删除[:100]之后爬取全部数据 def parse(self,response): lyric_lines=response.xpath('//*[@id="lrc"]/div[1]/text()').extract() lyric='' for lyric_line in lyric_lines: lyric+=lyric_line #print lyric lyricItem=LyricItem() lyricItem['lyric']=lyric lyricItem['song_url']=response.url yield lyricItem songinfo因为还没有用到所以不重要。 ------------------------------分割线-------------------------------------- #items.py import scrapy class SongUrlItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 class LyricItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() lyric=scrapy.Field() #歌词 song_url=scrapy.Field() #歌曲链接 class SongInfoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 song_title=scrapy.Field() #歌名 album=scrapy.Field() #专辑 #singer=scrapy.Field() #歌手 language=scrapy.Field() #语种 ------------------------------分割线-------------------------------------- 在middleware下加了几行: sleep_seconds = 0.2 # 模拟点击后休眠3秒,给出浏览器取得响应内容的时间 default_sleep_seconds = 1 # 无动作请求休眠的时间 def process_request(self, request, spider): spider.logger.info('--------Spider request processed: %s' % spider.name) page = None driver = webdriver.PhantomJS() spider.logger.info('--------request.url: %s' % request.url) driver.get(request.url) driver.implicitly_wait(0.2) # 仅休眠数秒加载页面后返回内容 time.sleep(self.sleep_seconds) page = driver.page_source driver.close() return HtmlResponse(request.url, body=page, encoding='utf-8', request=request) ------------------------------分割线-------------------------------------- setting中加了几行也改了几行: from faker import Factory f = Factory.create() USER_AGENT = f.user_agent() DOWNLOAD_DELAY = 0.2 DEFAULT_REQUEST_HEADERS = { 'Host': 'www.xiami.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'no-cache', 'Connection': 'Keep-Alive', } ITEM_PIPELINES = { 'xiami2.pipelines.Xiami2Pipeline': 300, }
scrapy 爬取遇到问题Filtered duplicate
用scrapy请求站点 http://bigfile.co.kr 的时候,显示Filtered duplicate request:no more duplicates错误,然后就结束了,加上dont_filter=True,重新运行,结果一直死循环,无法结束,也不能爬到东西,有没有大神看一下 ```python name = 'WebSpider' start_urls = ['http://bigfile.co.kr'] headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", 'Referer': 'http://www.baidu.com/', "Upgrade-Insecure-Requests": 1, "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" } def start_requests(self): request = scrapy.Request(url=self.start_urls[0], headers=self.headers, callback=self.parse) request.meta['url'] = self.start_urls[0] yield request ```
Scrapy爬虫问题
class LiepinspiderSpider(scrapy.Spider): name = 'liepinspider' allowed_domains = ['www.liepin.com'] start_urls = ['http://www.liepin.com/'] #要抓取的最大页数 max_page = 20 # dqs = "010" 是北京 20是上海 050020是广州 070020是杭州 170020是武汉 out_url = 'https://www.liepin.com/zhaopin/?ckid=2d4732e20cffdcd9&fromSearchBtn=2&init=-1&flushckid=1&dqs={dqs}&flushckid=1&key={key}&imscid=R000000058&headckid=2d4732e20cffdcd9&d_pageSize=40&siTag=I-7rQ0e90mv8a37po7dV3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_headId=6857362076fa97aa53548faedae4487b&d_ckId=6857362076fa97aa53548faedae4487b&d_sfrom=search_fp_bar&d_curPage=0' next_url_base = 'https://www.liepin.com/zhaopin/?ckid=f2d1b254babfd641&fromSearchBtn=2&init=-1&flushckid=1&dqs={dqs}&degradeFlag=0&key={key}&imscid=R000000058&headckid=2d4732e20cffdcd9&d_pageSize=40&siTag=I-7rQ0e90mv8a37po7dV3Q%7EF5FSJAXvyHmQyODXqGxdVw&d_headId=6857362076fa97aa53548faedae4487b&d_ckId=0818b2c9abf73593582b6a90e780804f&d_sfrom=search_fp_bar&d_curPage=0&curPage={page}' def start_requests(self): yield Request(url=self.out_url.format(key="python", dqs="010", ), callback=self.out_html_parse) def out_html_parse(self,response): #解析外部网页,生成各个工作的工作链接 if response.status ==200: job_urls = response.xpath('//div[@class="sojob-result "]/ul//a/@href').extract() for url in job_urls: if re.search((r"(.*shtml)"),url): #有的网页内容不是工作链接,用正则删去 if re.search((r"(.*shtml)"),url).group(1).startswith("https"): job_url = re.search((r"(.*shtml)"),url).group(1) else: job_url = "https://www.liepin.com"+ url yield Request(url = job_url , callback= self.job_url_parse) for page in range(1, self.max_page): yield Request(url=self.next_url_base.format(key="python", dqs="010", page=page), callback=self.out_html_parse) def job_url_parse(self, response): 解析页面的就先不写了 现在的问题是抓取时候只能抓取70-90个数据就结束了进程,所以现在想是不是callback遇到了问题,希望各位大神帮忙
scrapy爬取知乎首页乱码
爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下: ``` import scrapy HEADERS = { 'Host': 'www.zhihu.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'Origin': 'https://www.zhihu.com', 'Referer': 'https://www.zhihu.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, headers=HEADERS) def parse(self, response): print('========== parse ==========') print(response.text[:100]) body = response.body encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1'] for encoding in encodings: try: print('========== decode ' + encoding) print(body.decode(encoding)[:100]) print('========== decode end\n') except Exception as e: print('########## decode {0}, error: {1}\n'.format(encoding, e)) pass ``` 输出的log如下: D:\workspace_python\ZhihuSpider>scrapy crawl zhihu 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider) 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']} 2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened 2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com/) ========== parse ========== ��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"�� z�I�zLgü^�1� Q)Ա�_k}�䄍���/T����U�3���l��� ========== decode utf-8 ########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte ========== decode gbk ########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode gb2312 ########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode iso-8859-1 áø~!¢ 同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。
Python爬虫,用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,有没有大神帮忙看一下,谢谢
用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,代理设置后还是提示需要登录。 settings内的FirstSplash.middlewares.FirstsplashSpiderMiddleware':823和FirstsplashSpiderMiddleware里面的 request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999"是从网上搜的,代理ip是从【西刺免费代理IP】这个网站随便找的一个,scrapy crawl Doubanbook打印出来的text 内容是需要登录。有没有大神帮忙看看,感谢!运行结果: ![图片说明](https://img-ask.csdn.net/upload/201904/25/1556181491_319319.jpg) <br>spider代码: ``` name = 'doubanBook' category = '' def start_requests(self): serachBook = ['python','scala','spark'] for x in serachBook: self.category = x start_urls = ['https://book.douban.com/subject_search', ] url=start_urls[0]+"?search_text="+x self.log("开始爬取:"+url) yield SplashRequest(url,self.parse_pre) def parse_pre(self, response): print(response.text) ``` 中间件代理配置: ``` class FirstsplashSpiderMiddleware(object): def process_request(self, request, spider): print("进入代理") print(request.meta['splash']['args']['proxy']) request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999" print(request.meta['splash']['args']['proxy']) ``` settings配置: ``` BOT_NAME = 'FirstSplash' SPIDER_MODULES = ['FirstSplash.spiders'] NEWSPIDER_MODULE = 'FirstSplash.spiders' ROBOTSTXT_OBEY = False #docker+scrapy-splash配置 FEED_EXPORT_ENCODING='utf-8' #doucer服务地址 SPLASH_URL = 'http://127.0.0.1:8050' # 去重过滤器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 使用Splash的Http缓存 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' #此处配置改为splash自带配置 SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } #下载器中间件改为splash自带配置 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 'FirstSplash.middlewares.FirstsplashSpiderMiddleware':823, } # 模拟浏览器请求头 DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', } ```
刚学习用scrapy + selenium爬动态网页,但是不知道为什么就是不行,下面是代码,求大神指点!!!
MySpider里面是这样的: ``` class MySpider(scrapy.Spider): name = 'BAIScrapy' def start_requests(self): print('开始') url = 'https://www.bilibili.com/' request = scrapy.Request(url=url, callback=self.parse, dont_filter=True) request.meta['PhantomJS'] = True yield request def parse(self, response): print('Emmm...') item = BilibiliAnimeInfoScrapyItem() item['links'] = response.css('a::attr("href")').re("www.bilibili.com/bangumi/play/") ``` middlewares里面是这样的: ``` def process_reqeust(self, request, spider): print('进入selenium') driver = webdriver.PhantomJS() driver.get(request.url) element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID,'bili_bangumi'))) driver.quit() yield HtmlResponse(url=request.url, encoding='utf-8', body=driver.page_source, request=request) ``` settings里面是这样的: ``` USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1' ROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = { 'bilibili_anime_info_scrapy.middlewares.BilibiliAnimeInfoScrapyDownloaderMiddleware': 543, } ```
一个百度拇指医生爬虫,想要先实现爬取某个问题的所有链接,但是爬不出来东西。求各位大神帮忙看一下这是为什么?
#写在前面的话 在这个爬虫里我想实现把百度拇指医生里关于“咳嗽”的链接全部爬取下来,下一步要进行的是把爬取到的每个链接里的items里面的内容爬取下来,但是我在第一步就卡住了,求各位大神帮我看一下吧。之前刚刚发了一篇问答,但是不知道怎么回事儿,现在找不到了,(貌似是被删了...?)救救小白吧!感激不尽! 这个是我的爬虫的结构 ![图片说明](https://img-ask.csdn.net/upload/201911/27/1574787999_274479.png) ##ks: ``` # -*- coding: utf-8 -*- import scrapy from kesou.items import KesouItem from scrapy.selector import Selector from scrapy.spiders import Spider from scrapy.http import Request ,FormRequest import pymongo class KsSpider(scrapy.Spider): name = 'ks' allowed_domains = ['kesou,baidu.com'] start_urls = ['https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M'] def parse(self, response): item = KesouItem() contents = response.xpath('.//h3[@class="t"]') for content in contents: url = content.xpath('.//a/@href').extract()[0] item['url'] = url yield item if self.offset < 760: self.offset += 10 yield scrapy.Request(url = "https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=" + str(self.offset) + "&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M",callback=self.parse,dont_filter=True) ``` ##items: ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class KesouItem(scrapy.Item): # 问题ID question_ID = scrapy.Field() # 问题描述 question = scrapy.Field() # 医生回答发表时间 answer_pubtime = scrapy.Field() # 问题详情 description = scrapy.Field() # 医生姓名 doctor_name = scrapy.Field() # 医生职位 doctor_title = scrapy.Field() # 医生所在医院 hospital = scrapy.Field() ``` ##middlewares: ``` # -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class KesouSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class KesouDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) ``` ##piplines: ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy.utils.project import get_project_settings settings = get_project_settings() class KesouPipeline(object): def __init__(self): host = settings["MONGODB_HOST"] port = settings["MONGODB_PORT"] dbname = settings["MONGODB_DBNAME"] sheetname= settings["MONGODB_SHEETNAME"] # 创建MONGODB数据库链接 client = pymongo.MongoClient(host = host, port = port) # 指定数据库 mydb = client[dbname] # 存放数据的数据库表名 self.sheet = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.sheet.insert(data) return item ``` ##settings: ``` # -*- coding: utf-8 -*- # Scrapy settings for kesou project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'kesou' SPIDER_MODULES = ['kesou.spiders'] NEWSPIDER_MODULE = 'kesou.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'kesou (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0" # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'kesou.middlewares.KesouSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'kesou.middlewares.KesouDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'kesou.pipelines.KesouPipeline': 300, } # MONGODB 主机名 MONGODB_HOST = "127.0.0.1" # MONGODB 端口号 MONGODB_PORT = 27017 # 数据库名称 MONGODB_DBNAME = "ks" # 存放数据的表名称 MONGODB_SHEETNAME = "ks_urls" # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ``` ##run.py: ``` # -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl ks".split()) ``` ##这个是运行出来的结果: ``` PS D:\scrapy_project\kesou> scrapy crawl ks 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: kesou) 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twis.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryphy 2.6.1, Platform Windows-10-10.0.18362-SP0 2019-11-27 00:14:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'kesou', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['kesou.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67 2019-11-27 00:14:17 [scrapy.extensions.telnet] INFO: Telnet Password: 051629c46f34abdf 2019-11-27 00:14:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled item pipelines: ['kesou.pipelines.KesouPipeline'] 2019-11-27 00:14:19 [scrapy.core.engine] INFO: Spider opened 2019-11-27 00:14:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-27 00:14:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-27 00:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX% (referer: None) 2019-11-27 00:14:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFFSYX%2B1M> (referer: None) Traceback (most recent call last): File "d:\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "D:\scrapy_project\kesou\kesou\spiders\ks.py", line 19, in parse item['url'] = url File "d:\anaconda3\lib\site-packages\scrapy\item.py", line 73, in __setitem__ (self.__class__.__name__, key)) KeyError: 'KesouItem does not support field: url' 2019-11-27 00:14:20 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2019, 11, 26, 16, 14, 19, 863597)} 2019-11-27 00:14:21 [scrapy.core.engine] INFO: Spider closed (finished) ```
哪位大神帮忙解答一下爬虫返回信息{"code":"600","echo":"signature verification failed"}?
爬虫返回信息{"code":"600","echo":"signature verification failed"} 使用scrapy以及requests爬取手机APP应用时,返回的信息就是{"code":"600","echo":"signature verification failed"},请教大神帮忙解决。
Scrapy使用proxymiddle后无法链接
####换了好多ip,都是高匿ip,使用单页的requests.get测试都是可以使用的. ####但是在Scrapy中proxymiddle开启代理后一直拒绝连接.使用另一种直接写在爬虫里的代理方式也不可以 ####求大佬们的解决方案! ![图片说明](https://img-ask.csdn.net/upload/201908/02/1564733970_942972.jpg)
【帮帮孩子】scrapy框架请问如何在parse函数中调用已有的参数来构造post请求获得回传的数据包呀
刚接触scrapy框架一周的菜鸟,之前都没用过框架手撸爬虫的,这次遇到了一个问题,我先请求一个网页 ``` def start_requests(self): urls=["http://www.tiku.cn/index/index/questions?cid=14&cno=1&unitid=800417&chapterid=701354&typeid=600122&thrknowid=700137"] for url in urls: yield scrapy.Request(url=url,callback=self.parse) ``` 然后传给parse方法获得了question_ID这个关键参数,然后我想在这里面直接利用这个question_id这个参数构造post请求获得它回传的json数据包并保存在 item['正确答案']之中,请问我要如何实现?,谢谢大佬百忙之中抽空回答我的疑问,谢谢! ``` def parse(self, response): item = TikuItem () for i in range(1,11): QUESTION_ID=str(response.xpath('(/html/body/div[4]/div[2]/div[2]/div['+str(i)+']/div[@class="q-analysis text-l"]/@id)').extract_first()[3:]) item['question_ID']=QUESTION_ID ``` 这是我的items.py文件 ``` class TikuItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() question_ID=scrapy.Field()#题号 correct_answer=scrapy.Field()#正确答案 ```
pyqt5+scrapy传值问题
用pyqt5给爬虫做个界面,但是在界面中的lineEdit文本传不到爬虫中去(要爬微博所以得传一个用于搜索的关键字) 方法是设一个全局变量KEYWORD然后再在界面中用lineEdit修改这个全局变量,最后开启爬虫,读取这个修改后的KEYWORD 无关的函数我都改成pass方便查看- -,为什么这方法有错误,是因为开了另一个线程然后爬虫默认赋值为原来的关键字1 ? # -*- coding: utf-8 -*- KEYWORD = '关键字1' class Ui_Form(object): def setupUi(self, Form): Form.setObjectName("Form") Form.resize(769, 575) self.lineEdit = QLineEdit(Form) self.lineEdit.setGeometry(QRect(130, 50, 161, 21)) self.lineEdit.setObjectName("lineEdit") self.label = QLabel(Form) self.label.setGeometry(QRect(30, 50, 91, 21)) self.label.setObjectName("label") self.pushButton_2 = QPushButton(Form) self.pushButton_2.setGeometry(QRect(550, 40, 81, 41)) self.pushButton_2.setObjectName("pushButton_2") self.pushButton_3 = QPushButton(Form) self.pushButton_3.setGeometry(QRect(330, 40, 81, 41)) self.pushButton_3.setObjectName("pushButton_3") self.pushButton_4 = QPushButton(Form) self.pushButton_4.setGeometry(QRect(440, 40, 81, 41)) self.pushButton_4.setObjectName("pushButton_4") self.pushButton_5 = QPushButton(Form) self.pushButton_5.setGeometry(QRect(660, 40, 81, 41)) self.pushButton_5.setObjectName("pushButton_5") self.pushButton_4.clicked.connect(self.pop2) #开启爬虫 self.pushButton_2.clicked.connect(self.pop1) self.pushButton_3.clicked.connect(self.pop4) #开启cookiespool和修改关键字值 self.pushButton_5.clicked.connect(self.pop5) self.tableView = QTableView(Form) self.tableView.setGeometry(QRect(15, 131, 731, 421)) #设置tableView self.model = QStandardItemModel(1, 6) self.model.setHorizontalHeaderLabels(['作者id', '评论数', '正文', '转发数', '点赞数', 'user']) self.tableView.setEditTriggers(QAbstractItemView.NoEditTriggers) # 只读 self.tableView.resizeColumnsToContents() # 宽度和长度和显示内容相同 self.tableView.setModel(self.model) #设置tableView结束 self.tableView.setObjectName("tableView") self.label_2 = QLabel(Form) self.label_2.setGeometry(QRect(30, 110, 72, 15)) self.label_2.setObjectName("label_2") self.retranslateUi(Form) QMetaObject.connectSlotsByName(Form) def retranslateUi(self, Form): _translate = QCoreApplication.translate Form.setWindowTitle(_translate("Form", "Form")) self.label.setText(_translate("Form", "输入关键字")) self.pushButton_2.setText(_translate("Form", "显示结果")) self.pushButton_3.setText(_translate("Form", "启动服务")) self.pushButton_4.setText(_translate("Form", "开始抓取")) self.pushButton_5.setText(_translate("Form", "结果分析")) self.label_2.setText(_translate("Form", "结果显示")) #槽函数部分 def pop1(self): #从数据库显示数据 pass def pop2(self): #开启爬虫 new.run() def pop3(self): #退出 pass def pop4(self): #开启服务 在这修改关键字 比如传入的时关键字2 global KEYWORD KEYWORD = self.lineEdit.text() print(KEYWORD) #输出会显示关键字2 而不是关键字1 s.start() def pop5(self): #结果显示 pass if __name__ == '__main__': app = QApplication(sys.argv) MainWindow = QMainWindow() ui = Ui_Form() ui.setupUi(MainWindow) MainWindow.show() sys.exit(app.exec_()) #爬虫部分 class WeiboSpider(Spider): client = pymongo.MongoClient(host='127.0.0.1', port=27017) db = client.weibo p = db.weibo name = 'weibo' allowed_domains = ["weibo.cn"] start_url='https://weibo.cn/search/mblog' max_page = 100 count = 0 def start_requests(self): global KEYWORD keyword = KEYWORD #这里获取不到已经修改的关键字 print(keyword) #输出的还是关键字1 url='{url}?keyword={keyword}'.format(url=self.start_url, keyword=keyword) for page in range(self.max_page + 1): data = { 'mp' : str(self.max_page), 'page' : str(page) } yield FormRequest(url, callback=self.parse_index, formdata=data) def parse_index(self, response): pass def comment_detail(self, response): pass new.py 文件内容 from scrapy.crawler import CrawlerProcess from weibosearch.spiders.weibo import WeiboSpider def run(): process = CrawlerProcess() process.crawl(WeiboSpider) process.start()
GitHub下载的豆瓣爬虫程序,运行出现Process finished with exit code 0,求助
本人不是很懂爬虫,用的别人上传豆瓣爬虫的代码运行不出来结果也没有报错,百度了一下说是因为没有写输出,但我真的脑汁都要榨干了也编不出来代码,跪求帮助,如果大神你看到又恰好有时间,拜托帮帮我这个渣渣吧。 代码如下 ``` import scrapy import sys import re import random # from douban.items import UserItem from douban.items import DoubanItem class CollectSpider(scrapy.Spider): name = 'collect_spider' user_ids = [121253425] crawl_types = ['collect_movie'] douban_types = ['collect_movie', 'wise_movie'] # 看过的电影、想看的电影、在看的电影 url templates collect_movie_tpl = 'https://movie.douban.com/people/vividtime/collect?start={}' wish_movie_tpl = 'https://movie.douban.com/people/vividtime/wish?start={}' do_movie_tpl = 'https://movie.douban.com/people/vividtime/do?start={}' douban_url_templates = { 'collect_movie': collect_movie_tpl, 'wish_movie': wish_movie_tpl, 'do_movie': do_movie_tpl, } cookies_list = [] def get_random_cookies(self): """ get a random cookies in cookies_list """ cookies = random.choice(self.cookies_list) rt = {} for item in cookies.split(';'): key, value = item.split('=')[0].strip(), item.split('=')[1].strip() rt[key] = value return rt def start_requests(self): """ entry of this spider """ for user_id in self.user_ids: for douban_type in self.douban_types: if douban_type in self.crawl_types: url_template = self.douban_url_templates[douban_type] if url_template: request_url = url_template.format(user_id, 0) + '&' else: raise Exception('Wrong douban_type: %s' % douban_type) meta = { 'url_template': url_template, 'douban_type': douban_type, 'user_id': user_id } yield scrapy.Request(url=request_url, meta=meta, callback=self.parse_first_page) def parse_first_page(self, response): """ parse first page to get total pages and generate more requests """ meta = response.meta total_page = response.xpath('//*[@id="content"]//div[@class="paginator"]/a[last()]/text()').extract() total_page = int(total_page[0]) if total_page else 1 for page in range(total_page): request_url = meta['url_template'].format(meta['user_id'], 15 * page) yield scrapy.Request(url=request_url, meta=meta, callback=self.parse_content) def parse_content(self, response): """ parse collect item of response """ meta = response.meta item_list = response.xpath('//*[@id="content"]//div[@class="grid-view"]/div') for item in item_list: douban_item = DoubanItem() douban_item['item_type'] = meta['douban_type'] douban_item['user_id'] = meta['user_id'] item_id = item.xpath('div[@class="info"]/ul/li[@class="title"]/a/@href').extract() douban_item['item_id'] = int(item_id[0].split('/')[-2]) if item_id else None item_name = item.xpath('div[@class="info"]/ul/li[@class="title"]/a/em/text()').extract() douban_item['item_name'] = item_name[0].strip() if item_name else None item_other_name = item.xpath('div[@class="info"]/ul/li[@class="title"]/a').xpath('string(.)').extract() douban_item['item_other_name'] = item_other_name[0].replace(douban_item['item_name'], '').strip() if item_other_name else None item_intro = item.xpath('div[@class="info"]/ul/li[@class="intro"]/text()').extract() douban_item['item_intro'] = item_intro[0].strip() if item_intro else None item_rating = item.xpath('div[@class="info"]/ul/li[3]/*[starts-with(@class, "rating")]/@class').extract() douban_item['item_rating'] = int(item_rating[0][6]) if item_rating else None item_date = item.xpath('div[@class="info"]/ul/li[3]/*[@class="date"]/text()').extract() douban_item['item_date'] = item_date[0] if item_date else None item_tags = item.xpath('div[@class="info"]/ul/li[3]/*[@class="tags"]/text()').extract() douban_item['item_tags'] = item_tags[0].replace(u'标签: ', '') if item_tags else None item_comment = item.xpath('div[@class="info"]/ul/li[4]/*[@class="comment"]/text()').extract() douban_item['item_comment'] = item_comment[0] if item_comment else None item_poster_id = item.xpath('div[@class="pic"]/a[1]/img[1]/@src').extract() douban_item['item_poster_id'] = int(item_poster_id[0].split('/')[-1].split('.')[0][1:]) if item_poster_id else None yield douban_item ```
为什么这里会说key有问题?但是print的时候却没有事?(bs4,attrs)
萌新,刚学python,老师说让搞个爬虫,结果这个问题卡了我好一会.在pycharm 写的. 代码如下: ``` def get_html(url): target = [] headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} resp = requests.get(url, headers=headers) soup = BeautifulSoup(resp.text, 'lxml') target_undone =soup.find_all('a') ** for a in target_undone: print(a['href']) target.append(a['href']) return target** ``` 报错如下 ``` /travels/nanjing9/3751226.html /travels/nanjing9/3752303.html /travels/nanjing9/1368968.html /travels/nanjing9/2007739.html /travels/nanjing9/1426946.html /travels/nanjing9/1580737.html /travels/nanjing9/2694051.html /travels/nanjing9/1660560.html /travels/nanjing9/2716240.html /travels/nanjing9/1727375.html Traceback (most recent call last): File "C:/Users/MoslemMaster/PycharmProjects/Scrapy/Testing.py", line 63, in <module> print(get_html('http://you.ctrip.com/travels/nanjing9/t3-d1.html')) File "C:/Users/MoslemMaster/PycharmProjects/Scrapy/Testing.py", line 16, in get_html print(a['href']) File "C:\Users\MoslemMaster\PycharmProjects\Scrapy\venv\lib\site-packages\bs4\element.py", line 1071, in __getitem__ return self.attrs[key] KeyError: 'href' ``` ####难不成是在append的时候使用了其他函数么?似乎只要是赋值的时候都会出现这个问题,这是为啥啊?看着也是用了bs4库啊,为什么会提示key有问题?难不成attes字典被什么神秘力量给搞爆炸了?我尝试了 b = a['href'],结果同上,提示key error.那么大致和赋值有关系了.那是为啥啊? 我就是个弱智,有些a里面没有href,我真是弱智
终于明白阿里百度这样的大公司,为什么面试经常拿ThreadLocal考验求职者了
点击上面↑「爱开发」关注我们每晚10点,捕获技术思考和创业资源洞察什么是ThreadLocalThreadLocal是一个本地线程副本变量工具类,各个线程都拥有一份线程私有的数
程序员必须掌握的核心算法有哪些?
由于我之前一直强调数据结构以及算法学习的重要性,所以就有一些读者经常问我,数据结构与算法应该要学习到哪个程度呢?,说实话,这个问题我不知道要怎么回答你,主要取决于你想学习到哪些程度,不过针对这个问题,我稍微总结一下我学过的算法知识点,以及我觉得值得学习的算法。这些算法与数据结构的学习大多数是零散的,并没有一本把他们全部覆盖的书籍。下面是我觉得值得学习的一些算法以及数据结构,当然,我也会整理一些看过...
《奇巧淫技》系列-python!!每天早上八点自动发送天气预报邮件到QQ邮箱
此博客仅为我业余记录文章所用,发布到此,仅供网友阅读参考,如有侵权,请通知我,我会删掉。 补充 有不少读者留言说本文章没有用,因为天气预报直接打开手机就可以收到了,为何要多此一举发送到邮箱呢!!!那我在这里只能说:因为你没用,所以你没用!!! 这里主要介绍的是思路,不是天气预报!不是天气预报!!不是天气预报!!!天气预报只是用于举例。请各位不要再刚了!!! 下面是我会用到的两个场景: 每日下
死磕YOLO系列,YOLOv1 的大脑、躯干和手脚
YOLO 是我非常喜欢的目标检测算法,堪称工业级的目标检测,能够达到实时的要求,它帮我解决了许多实际问题。 这就是 YOLO 的目标检测效果。它定位了图像中物体的位置,当然,也能预测物体的类别。 之前我有写博文介绍过它,但是每次重新读它的论文,我都有新的收获,为此我准备写一个系列的文章来详尽分析它。这是第一篇,从它的起始 YOLOv1 讲起。 YOLOv1 的论文地址:https://www.c...
知乎高赞:中国有什么拿得出手的开源软件产品?(整理自本人原创回答)
知乎高赞:中国有什么拿得出手的开源软件产品? 在知乎上,有个问题问“中国有什么拿得出手的开源软件产品(在 GitHub 等社区受欢迎度较好的)?” 事实上,还不少呢~ 本人于2019.7.6进行了较为全面的回答,对这些受欢迎的 Github 开源项目分类整理如下: 分布式计算、云平台相关工具类 1.SkyWalking,作者吴晟、刘浩杨 等等 仓库地址: apache/skywalking 更...
20行Python代码爬取王者荣耀全英雄皮肤
引言 王者荣耀大家都玩过吧,没玩过的也应该听说过,作为时下最火的手机MOBA游戏,咳咳,好像跑题了。我们今天的重点是爬取王者荣耀所有英雄的所有皮肤,而且仅仅使用20行Python代码即可完成。 准备工作 爬取皮肤本身并不难,难点在于分析,我们首先得得到皮肤图片的url地址,话不多说,我们马上来到王者荣耀的官网: 我们点击英雄资料,然后随意地选择一位英雄,接着F12打开调试台,找到英雄原皮肤的图片...
简明易理解的@SpringBootApplication注解源码解析(包含面试提问)
欢迎关注文章系列 ,关注我 《提升能力,涨薪可待》 《面试知识,工作可待》 《实战演练,拒绝996》 欢迎关注我博客,原创技术文章第一时间推出 也欢迎关注公 众 号【Ccww笔记】,同时推出 如果此文对你有帮助、喜欢的话,那就点个赞呗,点个关注呗! 《提升能力,涨薪可待篇》- @SpringBootApplication注解源码解析 一、@SpringBootApplication 的作用是什...
西游记团队中如果需要裁掉一个人,会先裁掉谁?
2019年互联网寒冬,大批企业开始裁员,下图是网上流传的一张截图: 裁员不可避免,那如何才能做到不管大环境如何变化,自身不受影响呢? 我们先来看一个有意思的故事,如果西游记取经团队需要裁员一名,会裁掉谁呢,为什么? 西游记团队组成: 1.唐僧 作为团队teamleader,有很坚韧的品性和极高的原则性,不达目的不罢休,遇到任何问题,都没有退缩过,又很得上司支持和赏识(直接得到唐太宗的任命,既给袈...
Python语言高频重点汇总
Python语言高频重点汇总 GitHub面试宝典仓库 回到首页 目录: Python语言高频重点汇总 目录: 1. 函数-传参 2. 元类 3. @staticmethod和@classmethod两个装饰器 4. 类属性和实例属性 5. Python的自省 6. 列表、集合、字典推导式 7. Python中单下划线和双下划线 8. 格式化字符串中的%和format 9. 迭代器和生成器 10...
究竟你适不适合买Mac?
我清晰的记得,刚买的macbook pro回到家,开机后第一件事情,就是上了淘宝网,花了500元钱,找了一个上门维修电脑的师傅,上门给我装了一个windows系统。。。。。。 表砍我。。。 当时买mac的初衷,只是想要个固态硬盘的笔记本,用来运行一些复杂的扑克软件。而看了当时所有的SSD笔记本后,最终决定,还是买个好(xiong)看(da)的。 已经有好几个朋友问我mba怎么样了,所以今天尽量客观
程序员一般通过什么途径接私活?
二哥,你好,我想知道一般程序猿都如何接私活,我也想接,能告诉我一些方法吗? 上面是一个读者“烦不烦”问我的一个问题。其实不止是“烦不烦”,还有很多读者问过我类似这样的问题。 我接的私活不算多,挣到的钱也没有多少,加起来不到 20W。说实话,这个数目说出来我是有点心虚的,毕竟太少了,大家轻喷。但我想,恰好配得上“一般程序员”这个称号啊。毕竟苍蝇再小也是肉,我也算是有经验的人了。 唾弃接私活、做外
ES6基础-ES6的扩展
进行对字符串扩展,正则扩展,数值扩展,函数扩展,对象扩展,数组扩展。 开发环境准备: 编辑器(VS Code, Atom,Sublime)或者IDE(Webstorm) 浏览器最新的Chrome 字符串的扩展: 模板字符串,部分新的方法,新的unicode表示和遍历方法: 部分新的字符串方法 padStart,padEnd,repeat,startsWith,endsWith,includes 字...
Python爬虫爬取淘宝,京东商品信息
小编是一个理科生,不善长说一些废话。简单介绍下原理然后直接上代码。 使用的工具(Python+pycharm2019.3+selenium+xpath+chromedriver)其中要使用pycharm也可以私聊我selenium是一个框架可以通过pip下载 pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple/ 
阿里程序员写了一个新手都写不出的低级bug,被骂惨了。
你知道的越多,你不知道的越多 点赞再看,养成习惯 本文 GitHub https://github.com/JavaFamily 已收录,有一线大厂面试点思维导图,也整理了很多我的文档,欢迎Star和完善,大家面试可以参照考点复习,希望我们一起有点东西。 前前言 为啥今天有个前前言呢? 因为你们的丙丙啊,昨天有牌面了哟,直接被微信官方推荐,知乎推荐,也就仅仅是还行吧(心里乐开花)
Java工作4年来应聘要16K最后没要,细节如下。。。
前奏: 今天2B哥和大家分享一位前几天面试的一位应聘者,工作4年26岁,统招本科。 以下就是他的简历和面试情况。 基本情况: 专业技能: 1、&nbsp;熟悉Sping了解SpringMVC、SpringBoot、Mybatis等框架、了解SpringCloud微服务 2、&nbsp;熟悉常用项目管理工具:SVN、GIT、MAVEN、Jenkins 3、&nbsp;熟悉Nginx、tomca
Python爬虫精简步骤1 获取数据
爬虫的工作分为四步: 1.获取数据。爬虫程序会根据我们提供的网址,向服务器发起请求,然后返回数据。 2.解析数据。爬虫程序会把服务器返回的数据解析成我们能读懂的格式。 3.提取数据。爬虫程序再从中提取出我们需要的数据。 4.储存数据。爬虫程序把这些有用的数据保存起来,便于你日后的使用和分析。 这一篇的内容就是:获取数据。 首先,我们将会利用一个强大的库——requests来获取数据。 在电脑上安装
作为一个程序员,CPU的这些硬核知识你必须会!
CPU对每个程序员来说,是个既熟悉又陌生的东西? 如果你只知道CPU是中央处理器的话,那可能对你并没有什么用,那么作为程序员的我们,必须要搞懂的就是CPU这家伙是如何运行的,尤其要搞懂它里面的寄存器是怎么一回事,因为这将让你从底层明白程序的运行机制。 随我一起,来好好认识下CPU这货吧 把CPU掰开来看 对于CPU来说,我们首先就要搞明白它是怎么回事,也就是它的内部构造,当然,CPU那么牛的一个东
破14亿,Python分析我国存在哪些人口危机!
2020年1月17日,国家统计局发布了2019年国民经济报告,报告中指出我国人口突破14亿。 猪哥的朋友圈被14亿人口刷屏,但是很多人并没有看到我国复杂的人口问题:老龄化、男女比例失衡、生育率下降、人口红利下降等。 今天我们就来分析一下我们国家的人口数据吧! 更多有趣分析教程,扫描下方二维码关注vx公号「裸睡的猪」 即可查看! 一、背景 1.人口突破14亿 2020年1月17日,国家统计局发布
web前端javascript+jquery知识点总结
Javascript javascript 在前端网页中占有非常重要的地位,可以用于验证表单,制作特效等功能,它是一种描述语言,也是一种基于对象(Object)和事件驱动并具有安全性的脚本语言 ,语法同java类似,是一种解释性语言,边执行边解释。 JavaScript的组成: ECMAScipt 用于描述: 语法,变量和数据类型,运算符,逻辑控制语句,关键字保留字,对象。 浏览器对象模型(Br
Qt实践录:开篇
本系列文章介绍笔者的Qt实践之路。
在家远程办公效率低?那你一定要收好这个「在家办公」神器!
相信大家都已经收到国务院延长春节假期的消息,接下来,在家远程办公可能将会持续一段时间。 但是问题来了。远程办公不是人在电脑前就当坐班了,相反,对于沟通效率,文件协作,以及信息安全都有着极高的要求。有着非常多的挑战,比如: 1在异地互相不见面的会议上,如何提高沟通效率? 2文件之间的来往反馈如何做到及时性?如何保证信息安全? 3如何规划安排每天工作,以及如何进行成果验收? ......
作为一个程序员,内存和磁盘的这些事情,你不得不知道啊!!!
截止目前,我已经分享了如下几篇文章: 一个程序在计算机中是如何运行的?超级干货!!! 作为一个程序员,CPU的这些硬核知识你必须会! 作为一个程序员,内存的这些硬核知识你必须懂! 这些知识可以说是我们之前都不太重视的基础知识,可能大家在上大学的时候都学习过了,但是嘞,当时由于老师讲解的没那么有趣,又加上这些知识本身就比较枯燥,所以嘞,大家当初几乎等于没学。 再说啦,学习这些,也看不出来有什么用啊!
这个世界上人真的分三六九等,你信吗?
偶然间,在知乎上看到一个问题 一时间,勾起了我深深的回忆。 以前在厂里打过两次工,做过家教,干过辅导班,做过中介。零下几度的晚上,贴过广告,满脸、满手地长冻疮。   再回首那段岁月,虽然苦,但让我学会了坚持和忍耐。让我明白了,在这个世界上,无论环境多么的恶劣,只要心存希望,星星之火,亦可燎原。   下文是原回答,希望能对你能有所启发。   如果我说,这个世界上人真的分三六九等,
为什么听过很多道理,依然过不好这一生?
记录学习笔记是一个重要的习惯,不希望学习过的东西成为过眼云烟。做总结的同时也是一次复盘思考的过程。 本文是根据阅读得到 App上《万维钢·精英日课》部分文章后所做的一点笔记和思考。学习是一个系统的过程,思维模型的建立需要相对完整的学习和思考过程。以下观点是在碎片化阅读后总结的一点心得总结。
B 站上有哪些很好的学习资源?
哇说起B站,在小九眼里就是宝藏般的存在,放年假宅在家时一天刷6、7个小时不在话下,更别提今年的跨年晚会,我简直是跪着看完的!! 最早大家聚在在B站是为了追番,再后来我在上面刷欧美新歌和漂亮小姐姐的舞蹈视频,最近两年我和周围的朋友们已经把B站当作学习教室了,而且学习成本还免费,真是个励志的好平台ヽ(.◕ฺˇд ˇ◕ฺ;)ノ 下面我们就来盘点一下B站上优质的学习资源: 综合类 Oeasy: 综合
雷火神山直播超两亿,Web播放器事件监听是怎么实现的?
Web播放器解决了在手机浏览器和PC浏览器上播放音视频数据的问题,让视音频内容可以不依赖用户安装App,就能进行播放以及在社交平台进行传播。在视频业务大数据平台中,播放数据的统计分析非常重要,所以Web播放器在使用过程中,需要对其内部的数据进行收集并上报至服务端,此时,就需要对发生在其内部的一些播放行为进行事件监听。 那么Web播放器事件监听是怎么实现的呢? 01 监听事件明细表 名
3万字总结,Mysql优化之精髓
本文知识点较多,篇幅较长,请耐心学习 MySQL已经成为时下关系型数据库产品的中坚力量,备受互联网大厂的青睐,出门面试想进BAT,想拿高工资,不会点MySQL优化知识,拿offer的成功率会大大下降。 为什么要优化 系统的吞吐量瓶颈往往出现在数据库的访问速度上 随着应用程序的运行,数据库的中的数据会越来越多,处理时间会相应变慢 数据是存放在磁盘上的,读写速度无法和内存相比 如何优化 设计
一条链接即可让黑客跟踪你的位置! | Seeker工具使用
搬运自:冰崖的部落阁(icecliffsnet) 严正声明:本文仅限于技术讨论,严禁用于其他用途。 请遵守相对应法律规则,禁止用作违法途径,出事后果自负! 上次写的防社工文章里边提到的gps定位信息(如何防止自己被社工或人肉) 除了主动收集他人位置信息以外,我们还可以进行被动收集 (没有技术含量) Seeker作为一款高精度地理位置跟踪工具,同时也是社交工程学(社会工程学)爱好者...
作为程序员的我,大学四年一直自学,全靠这些实用工具和学习网站!
我本人因为高中沉迷于爱情,导致学业荒废,后来高考,毫无疑问进入了一所普普通通的大学,实在惭愧...... 我又是那么好强,现在学历不行,没办法改变的事情了,所以,进入大学开始,我就下定决心,一定要让自己掌握更多的技能,尤其选择了计算机这个行业,一定要多学习技术。 在进入大学学习不久后,我就认清了一个现实:我这个大学的整体教学质量和学习风气,真的一言难尽,懂的人自然知道怎么回事? 怎么办?我该如何更好的提升
前端JS初级面试题二 (。•ˇ‸ˇ•。)老铁们!快来瞧瞧自己都会了么
1. 传统事件绑定和符合W3C标准的事件绑定有什么区别? 传统事件绑定 &lt;div onclick=""&gt;123&lt;/div&gt; div1.onclick = function(){}; &lt;button onmouseover=""&gt;&lt;/button&gt; 注意: 如果给同一个元素绑定了两次或多次相同类型的事件,那么后面的绑定会覆盖前面的绑定 (不支持DOM事...
Python学习笔记(语法篇)
本篇博客大部分内容摘自埃里克·马瑟斯所著的《Python编程:从入门到实战》(入门类书籍),采用举例的方式进行知识点提要 关于Python学习书籍推荐文章 《学习Python必备的8本书》 Python语法特点: 通过缩进进行语句组织 不需要变量或参数的声明 冒号 1 变量和简单数据结构 1.1 变量命名 只能包含字母、数字和下划线,且不能以数字打头。 1.2 字符串 在Python中,用引号...
[Pyhon疫情大数据分析] 一.腾讯实时数据爬取、Matplotlib和Seaborn可视化分析全国各地区、某省各城市、新增趋势
思来想去,虽然很忙,但还是挤时间针对这次肺炎疫情写个Python大数据分析系列博客,包括网络爬虫、可视化分析、GIS地图显示、情感分析、舆情分析、主题挖掘、威胁情报溯源、知识图谱、预测预警及AI和NLP应用等。第一篇文章将分享腾讯疫情实时数据抓取,获取全国各地和贵州省各地区的实时数据,并将数据存储至本地,最后调用Maplotlib和Seaborn绘制中国各地区、贵州省各城市、新增人数的图形。希望这篇可视化分析文章对您有所帮助!
小白也会用的情人节表白神器
鉴于情人节女朋友总说直男,上网找了个模板,改了一下,发现效果还不错。然后又录了一个视频,发现凑合,能用。现在免费分享给程序员,去表白去吧。​​​​​​。当然比较low因为考研没时间优化,懒着优化了。 先看一下效果吧:页面太多了,这里我只放几个页面里面有音乐,还凑合不是太单调。 所有页面最后的合成效果: 接下来教大家如何使用: 新建文件夹:love 然后建立这几个...
相关热词 c#如何定义数组列表 c#倒序读取txt文件 java代码生成c# c# tcp发送数据 c#解决时间格式带星期 c#类似hashmap c#设置istbox的值 c#获取多线程返回值 c# 包含数字 枚举 c# timespan
立即提问