python3 scrapy Request 请求时怎么保持headers 的参数首字母不大写

python3 scrapy Request 请求时,scrapy 会自动将headers 中的参数 格式化,使其保持首字母大写,下划线等特殊符号后第一个字母大写。但现在有个问题 我要往服务端传一个headers的参数,但参数本身没有大写,经过scrapy 请求后参数变为首字母大写,服务器端根本不认这个参数,我就想问下有谁知道scrapy,Request 有不处理headers的方法吗? 但使用requests请求时,而不是用scrapy.Request时,headers 是没有变化的。he图片说明

这是headers 请求之前的

图片说明

这是抓包抓到的请求头

1个回答

在spider文件中spider类的上面写下不希望首字母大写的header

from twisted.web.http_headers import Headers as TwistedHeaders

TwistedHeaders._caseMappings.update({
    b'uuid': b'uuid',
    b'content-type': b'content-type',
})
qq_29581601
窃语 强 ,请测有效
8 个月之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?
scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?
求解关于scrapy请求会自动携带上一次请求中的set-cookie字段的问题
如图所示,在request进入downloader之前,headers里是没有cookie字段的,但是在下载结束后,request的headers字段里出现了cookie字段,且该cookie内容为上一次请求返回的set-cookie的内容,但是这里我其实是不需要这个request携带任何cookie进行请求的,尝试过在settting里设置cookies-enabled=False,这样虽然request.headers里的确没有cookie可以得到我希望的请求头,但是后续需要携带cookie的请求就没办法继续正常请求了,请问如何设置本次请求不携带上一次请求的xin'xi![图片说明](https://img-ask.csdn.net/upload/201904/24/1556070657_534485.png)
scrapy运行爬虫时报错Missing scheme in request url
scrapy刚入门小白一枚。用网上的案例代码来玩一玩,案例是http://blog.csdn.net/czl389/article/details/77278166 中的爬取嘻哈歌词。这个案例下有三只爬虫,分别是songurls,lyrics和songinfo。我用songurls爬虫能从虾米音乐上爬取了url并保存在SongUrls.csv中,但是在用lyrics爬虫的时候会报错。信息如下 **D:\xiami2\xiami2>scrapy crawl lyrics -o Lyrics.csv 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiami2) 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiami2.spiders', 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 4.0; Trident/3.0)', 'FEED_URI': 'Lyrics.csv', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 0.2, 'SPIDER_MODULES': ['xiami2.spiders'], 'BOT_NAME': 'xiami2'} 2017-10-21 21:13:29 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled item pipelines: ['xiami2.pipelines.Xiami2Pipeline'] 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider opened 2017-10-21 21:13:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-21 21:13:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-21 21:13:31 [scrapy.core.engine] ERROR: Error while obtaining start requests Traceback (most recent call last): File "d:\python3.5\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request request = next(slot.start_requests) File "d:\python3.5\lib\site-packages\scrapy\spiders\__init__.py", line 83, in start_requests yield Request(url, dont_filter=True) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__ self._set_url(url) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Closing spider (finished) 2017-10-21 21:13:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 567323), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 536236)} 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider closed (finished) _------------------------------分割线--------------------------------------_ 我去查看了一下_init_.py,发现如下语句。 if ':' not in self._url: raise ValueError('Missing scheme in request url: %s' % self._url) 网上的解决方法看了一些,都没有能解决我的问题的,因此在此讨教,望大家指点一二(真没C币了)。提问次数不多,若有格式方面缺陷还请包含。 另附上代码。 #songurls.py import scrapy import re from scrapy.spiders import CrawlSpider, Rule from ..items import SongUrlItem class SongurlsSpider(scrapy.Spider): name = 'songurls' allowed_domains = ['xiami.com'] #将page/1到page/401,这些链接放进start_urls start_url_list=[] url_fixed='http://www.xiami.com/song/tag/Hip-Hop/page/' #将range范围扩大为1-401,获得所有页面 for i in range(1,402): start_url_list.extend([url_fixed+str(i)]) start_urls=start_url_list def parse(self,response): urls=response.xpath('//*[@id="wrapper"]/div[2]/div/div/div[2]/table/tbody/tr/td[2]/a[1]/@href').extract() for url in urls: song_url=response.urljoin(url) url_item=SongUrlItem() url_item['song_url']=song_url yield url_item ------------------------------分割线-------------------------------------- #lyrics.py import scrapy import re class LyricsSpider(scrapy.Spider): name = 'lyrics' allowed_domains = ['xiami.com'] song_url_file='SongUrls.csv' def __init__(self, *args, **kwargs): #从song_url.csv 文件中读取得到所有歌曲url f = open(self.song_url_file,"r") lines = f.readlines() #这里line[:-1]的含义是每行末尾都是一个换行符,要去掉 #这里in lines[1:]的含义是csv第一行是字段名称,要去掉 song_url_list=[line[:-1] for line in lines[1:]] f.close() while '\n' in song_url_list: song_url_list.remove('\n') self.start_urls = song_url_list#[:100]#删除[:100]之后爬取全部数据 def parse(self,response): lyric_lines=response.xpath('//*[@id="lrc"]/div[1]/text()').extract() lyric='' for lyric_line in lyric_lines: lyric+=lyric_line #print lyric lyricItem=LyricItem() lyricItem['lyric']=lyric lyricItem['song_url']=response.url yield lyricItem songinfo因为还没有用到所以不重要。 ------------------------------分割线-------------------------------------- #items.py import scrapy class SongUrlItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 class LyricItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() lyric=scrapy.Field() #歌词 song_url=scrapy.Field() #歌曲链接 class SongInfoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 song_title=scrapy.Field() #歌名 album=scrapy.Field() #专辑 #singer=scrapy.Field() #歌手 language=scrapy.Field() #语种 ------------------------------分割线-------------------------------------- 在middleware下加了几行: sleep_seconds = 0.2 # 模拟点击后休眠3秒,给出浏览器取得响应内容的时间 default_sleep_seconds = 1 # 无动作请求休眠的时间 def process_request(self, request, spider): spider.logger.info('--------Spider request processed: %s' % spider.name) page = None driver = webdriver.PhantomJS() spider.logger.info('--------request.url: %s' % request.url) driver.get(request.url) driver.implicitly_wait(0.2) # 仅休眠数秒加载页面后返回内容 time.sleep(self.sleep_seconds) page = driver.page_source driver.close() return HtmlResponse(request.url, body=page, encoding='utf-8', request=request) ------------------------------分割线-------------------------------------- setting中加了几行也改了几行: from faker import Factory f = Factory.create() USER_AGENT = f.user_agent() DOWNLOAD_DELAY = 0.2 DEFAULT_REQUEST_HEADERS = { 'Host': 'www.xiami.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'no-cache', 'Connection': 'Keep-Alive', } ITEM_PIPELINES = { 'xiami2.pipelines.Xiami2Pipeline': 300, }
为什么我直接用requests爬网页可以,但用scrapy不行?
``` class job51(): def __init__(self): self.headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start(self): html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers) self.parse(html) def parse(self,response): tree=lxml.etree.HTML(response.text) resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href') print (resume_url[0] ``` 能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面? ``` class job51(Spider): name = "job51" #allowed_domains = ["my.51job.com"] start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"] headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start_requests(self): yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse) def parse(self,response): #tree=lxml.etree.HTML(text) selector=Selector(response) print ("<<<<<<<<<<<<<<<<<<<<<",response.text) resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href') print (">>>>>>>>>>>>",resume_url) ``` 输出的结果: scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'} 2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened 2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None) 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) <<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script> >>>>>>>>>>>> [] 2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) Traceback (most recent call last): File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse yield Request(resume_url[0],headers=self.headers,callback=self.getResume) File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 628, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 5743, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)} 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)
Python爬虫,用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,有没有大神帮忙看一下,谢谢
用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,代理设置后还是提示需要登录。 settings内的FirstSplash.middlewares.FirstsplashSpiderMiddleware':823和FirstsplashSpiderMiddleware里面的 request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999"是从网上搜的,代理ip是从【西刺免费代理IP】这个网站随便找的一个,scrapy crawl Doubanbook打印出来的text 内容是需要登录。有没有大神帮忙看看,感谢!运行结果: ![图片说明](https://img-ask.csdn.net/upload/201904/25/1556181491_319319.jpg) <br>spider代码: ``` name = 'doubanBook' category = '' def start_requests(self): serachBook = ['python','scala','spark'] for x in serachBook: self.category = x start_urls = ['https://book.douban.com/subject_search', ] url=start_urls[0]+"?search_text="+x self.log("开始爬取:"+url) yield SplashRequest(url,self.parse_pre) def parse_pre(self, response): print(response.text) ``` 中间件代理配置: ``` class FirstsplashSpiderMiddleware(object): def process_request(self, request, spider): print("进入代理") print(request.meta['splash']['args']['proxy']) request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999" print(request.meta['splash']['args']['proxy']) ``` settings配置: ``` BOT_NAME = 'FirstSplash' SPIDER_MODULES = ['FirstSplash.spiders'] NEWSPIDER_MODULE = 'FirstSplash.spiders' ROBOTSTXT_OBEY = False #docker+scrapy-splash配置 FEED_EXPORT_ENCODING='utf-8' #doucer服务地址 SPLASH_URL = 'http://127.0.0.1:8050' # 去重过滤器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 使用Splash的Http缓存 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' #此处配置改为splash自带配置 SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } #下载器中间件改为splash自带配置 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 'FirstSplash.middlewares.FirstsplashSpiderMiddleware':823, } # 模拟浏览器请求头 DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', } ```
scrapy 异常状态码使用代理只有第一个请求生效?
下载中间件的部分代码: def process_response(self, request, response, spider): status_code = [403] if response.status in status_code: spider.logger.debug('Error ======= %s %s , 开始使用 Proxy 代理' % (response.status, request.url)) import importlib proxy = ProxyMiddleware(settings=settings) request.meta['proxy'] = proxy.proxy_server request.headers['Proxy-Authorization'] = proxy.proxy_authorization return request else: return response 按理说,只要是状态码是403的,都会使用这个代理,直到不是403为止。 事实上运行截图:
scrapy 爬取遇到问题Filtered duplicate
用scrapy请求站点 http://bigfile.co.kr 的时候,显示Filtered duplicate request:no more duplicates错误,然后就结束了,加上dont_filter=True,重新运行,结果一直死循环,无法结束,也不能爬到东西,有没有大神看一下 ```python name = 'WebSpider' start_urls = ['http://bigfile.co.kr'] headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", 'Referer': 'http://www.baidu.com/', "Upgrade-Insecure-Requests": 1, "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" } def start_requests(self): request = scrapy.Request(url=self.start_urls[0], headers=self.headers, callback=self.parse) request.meta['url'] = self.start_urls[0] yield request ```
scrapy爬取知乎首页乱码
爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下: ``` import scrapy HEADERS = { 'Host': 'www.zhihu.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'Origin': 'https://www.zhihu.com', 'Referer': 'https://www.zhihu.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, headers=HEADERS) def parse(self, response): print('========== parse ==========') print(response.text[:100]) body = response.body encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1'] for encoding in encodings: try: print('========== decode ' + encoding) print(body.decode(encoding)[:100]) print('========== decode end\n') except Exception as e: print('########## decode {0}, error: {1}\n'.format(encoding, e)) pass ``` 输出的log如下: D:\workspace_python\ZhihuSpider>scrapy crawl zhihu 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider) 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']} 2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened 2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com/) ========== parse ========== ��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"�� z�I�zLgü^�1� Q)Ա�_k}�䄍���/T����U�3���l��� ========== decode utf-8 ########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte ========== decode gbk ########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode gb2312 ########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode iso-8859-1 áø~!¢ 同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。
一个百度拇指医生爬虫,想要先实现爬取某个问题的所有链接,但是爬不出来东西。求各位大神帮忙看一下这是为什么?
#写在前面的话 在这个爬虫里我想实现把百度拇指医生里关于“咳嗽”的链接全部爬取下来,下一步要进行的是把爬取到的每个链接里的items里面的内容爬取下来,但是我在第一步就卡住了,求各位大神帮我看一下吧。之前刚刚发了一篇问答,但是不知道怎么回事儿,现在找不到了,(貌似是被删了...?)救救小白吧!感激不尽! 这个是我的爬虫的结构 ![图片说明](https://img-ask.csdn.net/upload/201911/27/1574787999_274479.png) ##ks: ``` # -*- coding: utf-8 -*- import scrapy from kesou.items import KesouItem from scrapy.selector import Selector from scrapy.spiders import Spider from scrapy.http import Request ,FormRequest import pymongo class KsSpider(scrapy.Spider): name = 'ks' allowed_domains = ['kesou,baidu.com'] start_urls = ['https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M'] def parse(self, response): item = KesouItem() contents = response.xpath('.//h3[@class="t"]') for content in contents: url = content.xpath('.//a/@href').extract()[0] item['url'] = url yield item if self.offset < 760: self.offset += 10 yield scrapy.Request(url = "https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=" + str(self.offset) + "&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M",callback=self.parse,dont_filter=True) ``` ##items: ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class KesouItem(scrapy.Item): # 问题ID question_ID = scrapy.Field() # 问题描述 question = scrapy.Field() # 医生回答发表时间 answer_pubtime = scrapy.Field() # 问题详情 description = scrapy.Field() # 医生姓名 doctor_name = scrapy.Field() # 医生职位 doctor_title = scrapy.Field() # 医生所在医院 hospital = scrapy.Field() ``` ##middlewares: ``` # -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class KesouSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class KesouDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) ``` ##piplines: ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy.utils.project import get_project_settings settings = get_project_settings() class KesouPipeline(object): def __init__(self): host = settings["MONGODB_HOST"] port = settings["MONGODB_PORT"] dbname = settings["MONGODB_DBNAME"] sheetname= settings["MONGODB_SHEETNAME"] # 创建MONGODB数据库链接 client = pymongo.MongoClient(host = host, port = port) # 指定数据库 mydb = client[dbname] # 存放数据的数据库表名 self.sheet = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.sheet.insert(data) return item ``` ##settings: ``` # -*- coding: utf-8 -*- # Scrapy settings for kesou project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'kesou' SPIDER_MODULES = ['kesou.spiders'] NEWSPIDER_MODULE = 'kesou.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'kesou (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0" # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'kesou.middlewares.KesouSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'kesou.middlewares.KesouDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'kesou.pipelines.KesouPipeline': 300, } # MONGODB 主机名 MONGODB_HOST = "127.0.0.1" # MONGODB 端口号 MONGODB_PORT = 27017 # 数据库名称 MONGODB_DBNAME = "ks" # 存放数据的表名称 MONGODB_SHEETNAME = "ks_urls" # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ``` ##run.py: ``` # -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl ks".split()) ``` ##这个是运行出来的结果: ``` PS D:\scrapy_project\kesou> scrapy crawl ks 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: kesou) 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twis.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryphy 2.6.1, Platform Windows-10-10.0.18362-SP0 2019-11-27 00:14:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'kesou', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['kesou.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67 2019-11-27 00:14:17 [scrapy.extensions.telnet] INFO: Telnet Password: 051629c46f34abdf 2019-11-27 00:14:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled item pipelines: ['kesou.pipelines.KesouPipeline'] 2019-11-27 00:14:19 [scrapy.core.engine] INFO: Spider opened 2019-11-27 00:14:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-27 00:14:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-27 00:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX% (referer: None) 2019-11-27 00:14:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFFSYX%2B1M> (referer: None) Traceback (most recent call last): File "d:\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "D:\scrapy_project\kesou\kesou\spiders\ks.py", line 19, in parse item['url'] = url File "d:\anaconda3\lib\site-packages\scrapy\item.py", line 73, in __setitem__ (self.__class__.__name__, key)) KeyError: 'KesouItem does not support field: url' 2019-11-27 00:14:20 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2019, 11, 26, 16, 14, 19, 863597)} 2019-11-27 00:14:21 [scrapy.core.engine] INFO: Spider closed (finished) ```
用scrapy模拟登录知乎,返回“登录过于频繁,请稍后重试”
之前用requests模拟登录,加载到本地一直自动刷新。用scrapy登陆后,返回200和“登录过于频繁,请稍后重试”,求助dalao,搞了好多天了。代码如下: # -*- coding: utf-8 -*- import scrapy import json import re class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['http://www.zhihu.com/'] agent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0" header = { "HOST": "www.zhihu.com", "REFERER": "https://www.zhihu.com", "User-Agent": agent, "Connection":"Keep-Alive" } def parse(self, response): pass def start_requests(self): return [scrapy.Request("https://www.zhihu.com/signin?next=/",headers=self.header,callback=self.login)] def login(self,response): match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text, re.DOTALL) xsrf = '' if match_obj: xsrf = (match_obj.group(1)) if xsrf: post_url = "https://www.zhihu.com/login/phone_num" post_date = { "_xsrf": xsrf, "password": "13247161221", "phone_num": "yun10791023", 'captcha_type': 'cn', 'remember_me': 'true', } return [scrapy.FormRequest( url = post_url, formdata = post_date, headers = self.header, callback = self.checklogin )] def checklogin(self,response): #验证服务器的返回数据判断是否成功 text_json = json.loads(response.text) #此处打断点出现:登录过于频繁,请稍后重试 if "msg" in text_json and text_json["msg"] == "登陆成功": for url in self.start_urls: yield scrapy.Request(url,dont_filter=False,headers = self.header,)
python爬新浪新闻内容,为什么运行完stock里面为空……
#! /usr/bin/env python #coding=utf-8 from scrapy.selector import Selector from scrapy.http import Request import re,os from bs4 import BeautifulSoup from scrapy.spider import Spider import urllib2,thread #处理编码问题 import sys reload(sys) sys.setdefaultencoding('gb18030') #flag的作用是保证第一次爬取的时候不进行单个新闻页面内容的爬取 flag=1 projectpath='C:\\Users\DELL\\Desktop\\pythonproject\\mypro\\' def loop(*response): sel = Selector(response[0]) #get title title = sel.xpath('//h1/text()').extract() #get pages pages=sel.xpath('//div[@id="artibody"]//p/text()').extract() #get chanel_id & comment_id s=sel.xpath('//meta[@name="comment"]').extract() #comment_id = channel[index+3:index+15] index2=len(response[0].url) news_id=response[0].url[index2-14:index2-6] comment_id='31-1-'+news_id #评论内容都在这个list中 cmntlist=[] page=1 #含有新闻url,标题,内容,评论的文件 file2=None #该变量的作用是当某新闻下存在非手机用户评论时置为False is_all_tel=True while((page==1) or (cmntlist != [])): tel_count=0 #each page tel_user_count #提取到的评论url url="http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=cj&newsid="+str(comment_id)+"&group=0&compress=1&ie=gbk&oe=gbk&page="+str(page)+"&page_size=100" url_contain=urllib2.urlopen(url).read() b='={' after = url_contain[url_contain.index(b)+len(b)-1:] #字符串中的None对应python中的null,不然执行eval时会出错 after=after.replace('null','None') #转换为字典变量text text=eval(after) if 'cmntlist' in text['result']: cmntlist=text['result']['cmntlist'] else: cmntlist=[] if cmntlist != [] and (page==1): filename=str(comment_id)+'.txt' path=projectpath+'stock\\' +filename file2=open(path,'a+') news_content=str('') for p in pages: news_content=news_content+p+'\n' item="<url>"+response[0].url+"</url>"+'\n\n'+"<title>"+str(title[0])+"</title>\n\n"+"<content>\n"+str(news_content)+"</content>\n\n<comment>\n" file2.write(item) if cmntlist != []: content='' for status_dic in cmntlist: if status_dic['uid']!='0': is_all_tel=False #这一句视编码情况而定,在这里去掉decode和encode也行 s=status_dic['content'].decode('UTF-8').encode('GBK') #见另一篇博客“三张图” s=s.replace("'",'"') s=s.replace("\n",'') s1="u'"+s+"'" try: ss=eval(s1) except: try: s1='u"'+s+'"' ss=eval(s1) except: return content=content+status_dic['time']+'\t'+status_dic['uid']+'\t'+ss+'\n' #当属于手机用户时 else: tel_count=tel_count+1 #当一个page下不都是手机用户时,这里也可以用is_all_tel进行判断,一种是用开关的方式,一种是统计的方式 #算了不改了 if tel_count!=len(cmntlist): file2.write(content) page=page+1 #while loop end here if file2!=None: #当都是手机用户时,移除文件,否则写入"</comment>"到文件尾 if is_all_tel: file2.close() try: os.remove(file2.name) except WindowsError: pass else: file2.write("</comment>") file2.close() class DmozSpider(Spider): name = "stock" allowed_domains = ["sina.com.cn"] #在本程序中,start_urls并不重要,因为并没有解析 start_urls = [ "http://news.sina.com.cn/" ] global projectpath if os.path.exists(projectpath+'stock'): pass else: os.mkdir(projectpath+'stock') def parse(self, response): #这个scrapy.selector.Selector是个不错的处理字符串的类,python对编码很严格,它却处理得很好 #在做这个爬虫的时候,碰到很多奇奇怪怪的编码问题,主要是中文,试过很多既有的类,BeautifulSoup处理得也不是很好 sel = Selector(response) global flag if(flag==1): flag=2 page=1 while page<260: url="http://roll.finance.sina.com.cn/finance/zq1/index_" url=url+str(page)+".shtml" #伪装为浏览器 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent' : user_agent } req = urllib2.Request(url, headers=headers) response = urllib2.urlopen(req) url_contain = response.read() #利用BeautifulSoup进行文档解析 soup = BeautifulSoup(url_contain) params = soup.findAll('div',{'class':'listBlk'}) if os.path.exists(projectpath+'stock\\'+'link'): pass else: os.mkdir(projectpath+'stock\\'+'link') filename='link.txt' path=projectpath+'stock\\link\\' + filename filelink=open(path,'a+') for params_item in params: persons = params_item.findAll('li') for item in persons: href=item.find('a') mil_link= href.get('href') filelink.write(str(mil_link)+'\n') #递归调用parse,传入新的爬取url yield Request(mil_link, callback=self.parse) page=page+1 #对单个新闻页面新建线程进行爬取 if flag!=1: if (response.status != 404) and (response.status != 502): thread.start_new_thread(loop,(response,))
如何解决拉勾网302问题? 求大牛指导
最近在抓取拉勾网招聘信息的过程中 抓取一段时间后 会出现302重定向 ![图片说明](https://img-ask.csdn.net/upload/201910/16/1571190330_190767.png) 检查后发现被重定向至登录页面 ![图片说明](https://img-ask.csdn.net/upload/201910/16/1571190672_952436.png) 本以为完美解决 但结果并没有这么简单,登录后还是会出现302问题 求大神帮忙解惑!! settings配置如下: ``` BOT_NAME = 'LagouSpider' SPIDER_MODULES = ['LagouSpider.spiders'] NEWSPIDER_MODULE = 'LagouSpider.spiders' ROBOTSTXT_OBEY = False CONCURRENT_REQUESTS = 2 DOWNLOAD_DELAY = 3 #禁止重定向 COOKIES_ENABLED = False REDIRECT_ENABLED = False AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 2 DEFAULT_REQUEST_HEADERS = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Connection': 'keep-alive', 'Host': 'www.lagou.com', 'Origin': 'https://www.lagou.com', 'Referer': 'https://www.lagou.com/', } DOWNLOADER_MIDDLEWARES = { # 'LagouSpider.middlewares.LagouspiderDownloaderMiddleware': 543, 'LagouSpider.middlewares.RandomUserAgentMiddleware' : 100, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None, 'LagouSpider.middlewares.LagoucrawlerDownloaderMiddleware' : 543, } ```
Java学习的正确打开方式
在博主认为,对于入门级学习java的最佳学习方法莫过于视频+博客+书籍+总结,前三者博主将淋漓尽致地挥毫于这篇博客文章中,至于总结在于个人,实际上越到后面你会发现学习的最好方式就是阅读参考官方文档其次就是国内的书籍,博客次之,这又是一个层次了,这里暂时不提后面再谈。博主将为各位入门java保驾护航,各位只管冲鸭!!!上天是公平的,只要不辜负时间,时间自然不会辜负你。 何谓学习?博主所理解的学习,它是一个过程,是一个不断累积、不断沉淀、不断总结、善于传达自己的个人见解以及乐于分享的过程。
程序员必须掌握的核心算法有哪些?
由于我之前一直强调数据结构以及算法学习的重要性,所以就有一些读者经常问我,数据结构与算法应该要学习到哪个程度呢?,说实话,这个问题我不知道要怎么回答你,主要取决于你想学习到哪些程度,不过针对这个问题,我稍微总结一下我学过的算法知识点,以及我觉得值得学习的算法。这些算法与数据结构的学习大多数是零散的,并没有一本把他们全部覆盖的书籍。下面是我觉得值得学习的一些算法以及数据结构,当然,我也会整理一些看过...
有哪些让程序员受益终生的建议
从业五年多,辗转两个大厂,出过书,创过业,从技术小白成长为基层管理,联合几个业内大牛回答下这个问题,希望能帮到大家,记得帮我点赞哦。 敲黑板!!!读了这篇文章,你将知道如何才能进大厂,如何实现财务自由,如何在工作中游刃有余,这篇文章很长,但绝对是精品,记得帮我点赞哦!!!! 一腔肺腑之言,能看进去多少,就看你自己了!!! 目录: 在校生篇: 为什么要尽量进大厂? 如何选择语言及方...
大学四年自学走来,这些私藏的实用工具/学习网站我贡献出来了
大学四年,看课本是不可能一直看课本的了,对于学习,特别是自学,善于搜索网上的一些资源来辅助,还是非常有必要的,下面我就把这几年私藏的各种资源,网站贡献出来给你们。主要有:电子书搜索、实用工具、在线视频学习网站、非视频学习网站、软件下载、面试/求职必备网站。 注意:文中提到的所有资源,文末我都给你整理好了,你们只管拿去,如果觉得不错,转发、分享就是最大的支持了。 一、电子书搜索 对于大部分程序员...
linux系列之常用运维命令整理笔录
本博客记录工作中需要的linux运维命令,大学时候开始接触linux,会一些基本操作,可是都没有整理起来,加上是做开发,不做运维,有些命令忘记了,所以现在整理成博客,当然vi,文件操作等就不介绍了,慢慢积累一些其它拓展的命令,博客不定时更新 free -m 其中:m表示兆,也可以用g,注意都要小写 Men:表示物理内存统计 total:表示物理内存总数(total=used+free) use...
比特币原理详解
一、什么是比特币 比特币是一种电子货币,是一种基于密码学的货币,在2008年11月1日由中本聪发表比特币白皮书,文中提出了一种去中心化的电子记账系统,我们平时的电子现金是银行来记账,因为银行的背后是国家信用。去中心化电子记账系统是参与者共同记账。比特币可以防止主权危机、信用风险。其好处不多做赘述,这一层面介绍的文章很多,本文主要从更深层的技术原理角度进行介绍。 二、问题引入 假设现有4个人...
程序员接私活怎样防止做完了不给钱?
首先跟大家说明一点,我们做 IT 类的外包开发,是非标品开发,所以很有可能在开发过程中会有这样那样的需求修改,而这种需求修改很容易造成扯皮,进而影响到费用支付,甚至出现做完了项目收不到钱的情况。 那么,怎么保证自己的薪酬安全呢? 我们在开工前,一定要做好一些证据方面的准备(也就是“讨薪”的理论依据),这其中最重要的就是需求文档和验收标准。一定要让需求方提供这两个文档资料作为开发的基础。之后开发...
网页实现一个简单的音乐播放器(大佬别看。(⊙﹏⊙))
今天闲着无事,就想写点东西。然后听了下歌,就打算写个播放器。 于是乎用h5 audio的加上js简单的播放器完工了。 演示地点演示 html代码如下` music 这个年纪 七月的风 音乐 ` 然后就是css`*{ margin: 0; padding: 0; text-decoration: none; list-...
Python十大装B语法
Python 是一种代表简单思想的语言,其语法相对简单,很容易上手。不过,如果就此小视 Python 语法的精妙和深邃,那就大错特错了。本文精心筛选了最能展现 Python 语法之精妙的十个知识点,并附上详细的实例代码。如能在实战中融会贯通、灵活使用,必将使代码更为精炼、高效,同时也会极大提升代码B格,使之看上去更老练,读起来更优雅。
数据库优化 - SQL优化
以实际SQL入手,带你一步一步走上SQL优化之路!
2019年11月中国大陆编程语言排行榜
2019年11月2日,我统计了某招聘网站,获得有效程序员招聘数据9万条。针对招聘信息,提取编程语言关键字,并统计如下: 编程语言比例 rank pl_ percentage 1 java 33.62% 2 cpp 16.42% 3 c_sharp 12.82% 4 javascript 12.31% 5 python 7.93% 6 go 7.25% 7 p...
通俗易懂地给女朋友讲:线程池的内部原理
餐盘在灯光的照耀下格外晶莹洁白,女朋友拿起红酒杯轻轻地抿了一小口,对我说:“经常听你说线程池,到底线程池到底是个什么原理?”
《奇巧淫技》系列-python!!每天早上八点自动发送天气预报邮件到QQ邮箱
将代码部署服务器,每日早上定时获取到天气数据,并发送到邮箱。 也可以说是一个小型人工智障。 知识可以运用在不同地方,不一定非是天气预报。
经典算法(5)杨辉三角
杨辉三角 是经典算法,这篇博客对它的算法思想进行了讲解,并有完整的代码实现。
英特尔不为人知的 B 面
从 PC 时代至今,众人只知在 CPU、GPU、XPU、制程、工艺等战场中,英特尔在与同行硬件芯片制造商们的竞争中杀出重围,且在不断的成长进化中,成为全球知名的半导体公司。殊不知,在「刚硬」的背后,英特尔「柔性」的软件早已经做到了全方位的支持与支撑,并持续发挥独特的生态价值,推动产业合作共赢。 而对于这一不知人知的 B 面,很多人将其称之为英特尔隐形的翅膀,虽低调,但是影响力却不容小觑。 那么,在...
腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹?
昨天,有网友私信我,说去阿里面试,彻底的被打击到了。问了为什么网上大量使用ThreadLocal的源码都会加上private static?他被难住了,因为他从来都没有考虑过这个问题。无独有偶,今天笔者又发现有网友吐槽了一道腾讯的面试题,我们一起来看看。 腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹? 在互联网职场论坛,一名程序员发帖求助到。二面腾讯,其中一个算法题:64匹...
面试官:你连RESTful都不知道我怎么敢要你?
干货,2019 RESTful最贱实践
为啥国人偏爱Mybatis,而老外喜欢Hibernate/JPA呢?
关于SQL和ORM的争论,永远都不会终止,我也一直在思考这个问题。昨天又跟群里的小伙伴进行了一番讨论,感触还是有一些,于是就有了今天这篇文。 声明:本文不会下关于Mybatis和JPA两个持久层框架哪个更好这样的结论。只是摆事实,讲道理,所以,请各位看官勿喷。 一、事件起因 关于Mybatis和JPA孰优孰劣的问题,争论已经很多年了。一直也没有结论,毕竟每个人的喜好和习惯是大不相同的。我也看...
白话阿里巴巴Java开发手册高级篇
不久前,阿里巴巴发布了《阿里巴巴Java开发手册》,总结了阿里巴巴内部实际项目开发过程中开发人员应该遵守的研发流程规范,这些流程规范在一定程度上能够保证最终的项目交付质量,通过在时间中总结模式,并推广给广大开发人员,来避免研发人员在实践中容易犯的错误,确保最终在大规模协作的项目中达成既定目标。 无独有偶,笔者去年在公司里负责升级和制定研发流程、设计模板、设计标准、代码标准等规范,并在实际工作中进行...
SQL-小白最佳入门sql查询一
不要偷偷的查询我的个人资料,即使你再喜欢我,也不要这样,真的不好;
redis分布式锁,面试官请随便问,我都会
文章有点长并且绕,先来个图片缓冲下! 前言 现在的业务场景越来越复杂,使用的架构也就越来越复杂,分布式、高并发已经是业务要求的常态。像腾讯系的不少服务,还有CDN优化、异地多备份等处理。 说到分布式,就必然涉及到分布式锁的概念,如何保证不同机器不同线程的分布式锁同步呢? 实现要点 互斥性,同一时刻,智能有一个客户端持有锁。 防止死锁发生,如果持有锁的客户端崩溃没有主动释放锁,也要保证锁可以正常释...
项目中的if else太多了,该怎么重构?
介绍 最近跟着公司的大佬开发了一款IM系统,类似QQ和微信哈,就是聊天软件。我们有一部分业务逻辑是这样的 if (msgType = "文本") { // dosomething } else if(msgType = "图片") { // doshomething } else if(msgType = "视频") { // doshomething } else { // doshom...
Nginx 原理和架构
Nginx 是一个免费的,开源的,高性能的 HTTP 服务器和反向代理,以及 IMAP / POP3 代理服务器。Nginx 以其高性能,稳定性,丰富的功能,简单的配置和低资源消耗而闻名。 Nginx 的整体架构 Nginx 里有一个 master 进程和多个 worker 进程。master 进程并不处理网络请求,主要负责调度工作进程:加载配置、启动工作进程及非停升级。worker 进程负责处...
Python 编程开发 实用经验和技巧
Python是一门很灵活的语言,也有很多实用的方法,有时候实现一个功能可以用多种方法实现,我这里总结了一些常用的方法和技巧,包括小数保留指定位小数、判断变量的数据类型、类方法@classmethod、制表符中文对齐、遍历字典、datetime.timedelta的使用等,会持续更新......
YouTube排名第一的励志英文演讲《Dream(梦想)》
Idon’t know what that dream is that you have, I don't care how disappointing it might have been as you've been working toward that dream,but that dream that you’re holding in your mind, that it’s po...
“狗屁不通文章生成器”登顶GitHub热榜,分分钟写出万字形式主义大作
一、垃圾文字生成器介绍 最近在浏览GitHub的时候,发现了这样一个骨骼清奇的雷人项目,而且热度还特别高。 项目中文名:狗屁不通文章生成器 项目英文名:BullshitGenerator 根据作者的介绍,他是偶尔需要一些中文文字用于GUI开发时测试文本渲染,因此开发了这个废话生成器。但由于生成的废话实在是太过富于哲理,所以最近已经被小伙伴们给玩坏了。 他的文风可能是这样的: 你发现,...
程序员:我终于知道post和get的区别
是一个老生常谈的话题,然而随着不断的学习,对于以前的认识有很多误区,所以还是需要不断地总结的,学而时习之,不亦说乎
《程序人生》系列-这个程序员只用了20行代码就拿了冠军
你知道的越多,你不知道的越多 点赞再看,养成习惯GitHub上已经开源https://github.com/JavaFamily,有一线大厂面试点脑图,欢迎Star和完善 前言 这一期不算《吊打面试官》系列的,所有没前言我直接开始。 絮叨 本来应该是没有这期的,看过我上期的小伙伴应该是知道的嘛,双十一比较忙嘛,要值班又要去帮忙拍摄年会的视频素材,还得搞个程序员一天的Vlog,还要写BU...
加快推动区块链技术和产业创新发展,2019可信区块链峰会在京召开
11月8日,由中国信息通信研究院、中国通信标准化协会、中国互联网协会、可信区块链推进计划联合主办,科技行者协办的2019可信区块链峰会将在北京悠唐皇冠假日酒店开幕。   区块链技术被认为是继蒸汽机、电力、互联网之后,下一代颠覆性的核心技术。如果说蒸汽机释放了人类的生产力,电力解决了人类基本的生活需求,互联网彻底改变了信息传递的方式,区块链作为构造信任的技术有重要的价值。   1...
Java世界最常用的工具类库
Apache Commons Apache Commons有很多子项目 Google Guava 参考博客
程序员把地府后台管理系统做出来了,还有3.0版本!12月7号最新消息:已在开发中有github地址
第一幕:缘起 听说阎王爷要做个生死簿后台管理系统,我们派去了一个程序员…… 996程序员做的梦: 第一场:团队招募 为了应对地府管理危机,阎王打算找“人”开发一套地府后台管理系统,于是就在地府总经办群中发了项目需求。 话说还是中国电信的信号好,地府都是满格,哈哈!!! 经常会有外行朋友问:看某网站做的不错,功能也简单,你帮忙做一下? 而这次,面对这样的需求,这个程序员...
网易云6亿用户音乐推荐算法
网易云音乐是音乐爱好者的集聚地,云音乐推荐系统致力于通过 AI 算法的落地,实现用户千人千面的个性化推荐,为用户带来不一样的听歌体验。 本次分享重点介绍 AI 算法在音乐推荐中的应用实践,以及在算法落地过程中遇到的挑战和解决方案。 将从如下两个部分展开: AI算法在音乐推荐中的应用 音乐场景下的 AI 思考 从 2013 年 4 月正式上线至今,网易云音乐平台持续提供着:乐屏社区、UGC...
8年经验面试官详解 Java 面试秘诀
作者 |胡书敏 责编 | 刘静 出品 | CSDN(ID:CSDNnews) 本人目前在一家知名外企担任架构师,而且最近八年来,在多家外企和互联网公司担任Java技术面试官,前后累计面试了有两三百位候选人。在本文里,就将结合本人的面试经验,针对Java初学者、Java初级开发和Java开发,给出若干准备简历和准备面试的建议。 Java程序员准备和投递简历的实...
面试官如何考察你的思维方式?
1.两种思维方式在求职面试中,经常会考察这种问题:北京有多少量特斯拉汽车?某胡同口的煎饼摊一年能卖出多少个煎饼?深圳有多少个产品经理?一辆公交车里能装下多少个乒乓球?一个正常成年人有多少根头发?这类估算问题,被称为费米问题,是以科学家费米命名的。为什么面试会问这种问题呢?这类问题能把两类人清楚地区分出来。一类是具有文科思维的人,擅长赞叹和模糊想象,它主要依靠的是人的第一反应和直觉,比如小孩...
全网阅读过20k的Java集合框架常见面试题总结!
本文为 SnailClimb 的原创,目前已经收录自我开源的 JavaGuide 中(61.5 k Star!【Java学习 面试指南】 一份涵盖大部分Java程序员所需要掌握的核心知识。欢迎 Star!)。 文末有我的公众号,公众号里有我最新整理的Java学习资料,免费分享。 这么好的文章,一定好先赞后看!!!建议养成这个好习惯!!爱你们!???? 剖析面试最常见问题之Java集合框架 当了...
17张图带你解析红黑树的原理!保证你能看懂!
二叉查找树 由于红黑树本质上就是一棵二叉查找树,所以在了解红黑树之前,咱们先来看下二叉查找树。 二叉查找树(Binary Search Tree),也称有序二叉树(ordered binary tree),排序二叉树(sorted binary tree),是指一棵空树或者具有下列性质的二叉树: 若任意结点的左子树不空,则左子树上所有结点的值均小于它的根结点的值; 若任意结点的...
so easy! 10行代码写个"狗屁不通"文章生成器
前几天,GitHub 有个开源项目特别火,只要输入标题就可以生成一篇长长的文章。 背后实现代码一定很复杂吧,里面一定有很多高深莫测的机器学习等复杂算法 不过,当我看了源代码之后 这程序不到50行 尽管我有多年的Python经验,但我竟然一时也没有看懂 当然啦,原作者也说了,这个代码也是在无聊中诞生的,平时撸码是不写中文变量名的, 中文...
相关热词 c# 引用mysql c#动态加载非托管dll c# 两个表数据同步 c# 返回浮点json c# imap 链接状态 c# 漂亮字 c# 上取整 除法 c#substring c#中延时关闭 c#线段拖拉
立即提问