Scrapy 中关于 request 和 yield 的问题,求大佬帮忙··

我目前需要爬取某个网页的一些指标,其中有一个是判断网页中是否有robots.txt文件,这个指标我的判断的方法是访问’www.baidu.com/robots.txt‘,根据其response.code判断,但是我想把这个指标和其他指标放到一个item里,请问应该怎么做。

我本想用yield scrapy.request(‘www.baidu.com/robots.txt’)来操作,但是获取不到其返回值,而且yield和return不能同时存在,望各位大佬解答一下,万分感激。

鄙人QQ642026725,欢迎各位大佬指导,萌新不胜感激

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
Scrapy yield Request 失效问题

想爬一下 游戏信息和评论 评论,信息是分开的 所以用了两个方法 parse中 用yield 进入第二个方法 和回调自己 都没问题 parse two 中 用yield 回调就不行 但是也不报错 就是没有进行 ``` def parse(self, response): #print response.body selector = scrapy.Selector(response) games = selector.xpath('//div[@class="app-item-caption"]/a[@class="item-caption-title flex-text-overflow"]/@href').extract() for game in games: game = game + '/review' yield scrapy.http.Request(game, callback=self.parse_two) # print game #游戏列表下一页 nextPage = selector.xpath('//ul[@class="pagination"]/li[last()]/a/@href').extract() if nextPage: next = nextPage[0] # print next yield scrapy.http.Request(next, callback=self.parse) def parse_two(self,response): Gid = response.url[27:32] Gid = int(Gid) selector = scrapy.Selector(response) game_review_times = selector.xpath('//a[@class="text-header-time"]/span/@data-dynamic-time').extract() game_reviews = selector.xpath('//div[@class="review-item-text"]/div[@class="item-text-body"]').extract() game_reivew_author = selector.xpath('//span[@class="taptap-user"]/a/text()').extract() reviewNo = 1 review_dict = {} # 处理评论 for review in game_reviews: # 计算每天评论量 # time_day = time.strftime('%Y-%m-%d',time.localtime(int(game_review_times[reviewNo - 1]))) # if review_dict.get(time_day): # review_dict[time_day] += 1 # else: # review_dict[time_day] = 1 review_lines = re.findall('<p>(.*?)</p>',review,re.S) review = '' for line in review_lines: review += line item = TaptapItem() item['Review_GID'] = Gid item['Review_content'] = review item['Review_Author'] = game_reivew_author[reviewNo-1] item['Reivew_Time'] = game_review_times[reviewNo-1] yield item print '评论%d:'%reviewNo print game_review_times[reviewNo-1] print review reviewNo += 1 #评论下一页 nextPage = selector.xpath('//ul[@class="pagination"]/li[last()]/a/@href').extract() if nextPage: next = nextPage[0] # print next yield scrapy.http.Request(next, callback=self.parse_two) ```

scrapy request发生重定向问题

from scrapy.spider import CrawlSpider from scrapy.selector import Selector from scrapy.http import Request class Spider(CrawlSpider): name = 'wordSpider' NUM = 14220485 start_urls = [ "http://baike.baidu.com/view/1.htm" ] fi = open('e:/word.txt', 'w') cnt = 2 def parse(self,response): selector = Selector(response) word = selector.xpath('body/div[@class="body-wrapper"]/div[@class="content-wrapper"]/div[@class="content"]/div[@class="main-content"]/dl/dd/h1/text()').extract_first() #word = selector.xpath('body/div[@id="J-lemma"]/div[@class="body-wrapper"]/div[@class="card-part"]/span[@class="lemma-title"]/text()').extract() self.fi.write(word + '\t' + 'n') if self.cnt <= self.NUM: wurl = "http://baike.baidu.com/view/%s.htm" % self.cnt self.cnt += 1 yield Request(url=wurl, meta={}, callback=self.parse) 这是我的爬虫源码,如何阻止301/302重定向,要抓取的是百度所有词条,但是总会发生重定向导致无法获得想要的网页

没有进行筛选,scrapy-Request callback不调用,跪求大神指点!!!

Spider的代码是这样的: ``` def parse(self, response): url_list = response.xpath('//a/@href').extract()[0] for single_url in url_list: url = 'https:' + single_url.xpath('./@href').extract()[0] name = single_url.xpath('./text()').extract()[0] yield scrapy.Request(url=url, callback=self.parse_get, meta={'url':url, 'name':name}) def parse_get(self, response): print(1) item = MySpiderItem() item['name'] = response.mate['name'] item['url'] = response.mate['url'] yield item ``` middlewares的代码是这样的: ``` def process_request(self, request, spider): self.driver = webdriver.Chrome() self.driver.get(request.url) if 'anime' in request.meta: element = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.ID, 'header'))) else: element = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.ID, 'header'))) html = self.driver.page_source self.driver.quit() return scrapy.http.HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8') ``` 我是用Chrome来运行的,Request里面的url是一个一个地打开了,但是一直没有调用parse_get。一直都没有加allowed_domains,也尝试过在Request中加dont_filter=True,但是网站能打开,证明应该不是网站被过滤了的问题。实在是没有想法了,求大神指导!!!!

scrapy中Spider中的变量如何传递给Middleware中的request中

在获取了response响应中的内容后,需要将response的部分内容更新到cookie中。 但是获取response的内容实在自定义的parse函数中,而更新cookie是在Middleware中的process\_request()中,那如何将Spider中的parse函数中的变量传递到Middleware中的process\_request中呢? 下边是我的函数 ![图片说明](https://img-ask.csdn.net/upload/201906/03/1559527420_478414.png) 以上还请大神指点一下~~

python3 scrapy Request 请求时怎么保持headers 的参数首字母不大写

python3 scrapy Request 请求时,scrapy 会自动将headers 中的参数 格式化,使其保持首字母大写,下划线等特殊符号后第一个字母大写。但现在有个问题 我要往服务端传一个headers的参数,但参数本身没有大写,经过scrapy 请求后参数变为首字母大写,服务器端根本不认这个参数,我就想问下有谁知道scrapy,Request 有不处理headers的方法吗? 但使用requests请求时,而不是用scrapy.Request时,headers 是没有变化的。![he图片说明](https://img-ask.csdn.net/upload/201905/15/1557909540_468021.png) 这是headers 请求之前的 ![图片说明](https://img-ask.csdn.net/upload/201905/15/1557909657_878941.png) 这是抓包抓到的请求头

scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

Scrapy多级页面爬取,程序运行顺序问题

``` # -*- coding: utf-8 -*- import scrapy from SYDW.items import SydwItem class DanweiCrawlingSpider(scrapy.Spider): # 继承自Spider类 name = 'danwei_crawling' allowed_domains = ['chinasydw.org']#允许域名 start_urls = ['http://www.chinasydw.org'] base_domain = 'http://www.chinasydw.org' def parse(self, response): province = response.xpath("//div[@class='fenzhan']//a/@href") for each_p in province: yield scrapy.Request(each_p.get(),callback=self.get_page) def get_page(self,response): for each in response.xpath("//div[@class='body']/ul[@class = 'list11 clearfix']/li[not(@class='ivl')]"): item = SydwItem() name = response.xpath("//div[@class='body']/ul[@class = 'list11 clearfix']/li[not(@class='ivl')]/a[not(@style)]/text()").get() time = response.xpath("//div[@class='body']/ul[@class = 'list11 clearfix']/li[not(@class='ivl')]/span[@class='time']/text()").get() link = response.xpath("//div[@class='body']/ul[@class = 'list11 clearfix']/li[not(@class='ivl')]/a[not(@style)]/@href").get() item['name']=name item['time']=time item['link']=link yield item next_url = response.xpath("//div[@class='pageset']/a[last()]/@href").get() yield scrapy.Request(self.base_domain+next_url,callback=self.get_page,meta={'item':item}) ``` 思路: parse函数获取分站信息 然后进入分站,获取分站每一页信息。 问题: 爬取顺序问题: 按设想应该是进入分站——爬取完多个页面——进入下一个分站——爬取完多个页面;实际上运行结果是进入分站——爬取完当前页面——进入下一个分站——爬取完当前页面,直到所有分站遍历完后再开始进入分站的下一页。 初学scrapy,希望得到大家的帮助。

python scrapy 爬取多页合并问题

scrapy学习有几个月了,普通scrapy和crawl都能够实现,现在碰到一个问题: 在使用scrapy爬取多分页后,如何把多分页内容合并写入到一个item[x]内? 我现在使用 yield Request 至 def art_url 来获取分页内容,用append把内容集合后,用 item['image_urls'] = self.art_urls 来接收结果, 但结果一直接收,每篇内容的分页的接收导致很多,请教一下,如何把每篇的分页内容合并写入一项itme? 刚学不到半年,代码凌乱,望包含,主要是想学习如何爬取小说站,把每一章都合并在一起,不要分页搞很多数据,和合适代码推荐下,研究学习,谢谢了 我的代码: ``` art_urls = [] rules = ( Rule(LinkExtractor(allow='wenzhang/',restrict_xpaths=('//table[@id="dlNews"]')), callback='parse_item', follow=True), ) def parse_item(self, response): print(response.url) item = SpiderItem() conn = Redis(host='127.0.0.1', port=6379) item['title'] = response.xpath('//h1/text()').extract_first() ex = conn.sadd('movies_url', response.url) for next_href in response.xpath('//div[@class="pager"]/ul/li/a/@href').extract(): next_url = self.base_url + next_href.replace('../','') if ex == 1: # print('开始解析单页') yield Request(next_url, callback=self.art_url) # yield scrapy.Request(url=next_url, callback=self.parse_detail, meta={'title': title,'img_src':img_src}) else: print("无数据更新!!!") # print(self.art_urls) item['image_urls'] = self.art_urls # print(len(item['image_urls'])) # print(item) yield item def art_url(self, response): art_urls = response.xpath('//div[@id="content"]/div/p/img/@src').extract() for art_url in art_urls: # 开始解析分页 art_url = art_url.replace('../../upload/','') self.art_urls.append(art_url) ```

scrapy运行爬虫时报错Missing scheme in request url

scrapy刚入门小白一枚。用网上的案例代码来玩一玩,案例是http://blog.csdn.net/czl389/article/details/77278166 中的爬取嘻哈歌词。这个案例下有三只爬虫,分别是songurls,lyrics和songinfo。我用songurls爬虫能从虾米音乐上爬取了url并保存在SongUrls.csv中,但是在用lyrics爬虫的时候会报错。信息如下 **D:\xiami2\xiami2>scrapy crawl lyrics -o Lyrics.csv 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiami2) 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiami2.spiders', 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 4.0; Trident/3.0)', 'FEED_URI': 'Lyrics.csv', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 0.2, 'SPIDER_MODULES': ['xiami2.spiders'], 'BOT_NAME': 'xiami2'} 2017-10-21 21:13:29 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled item pipelines: ['xiami2.pipelines.Xiami2Pipeline'] 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider opened 2017-10-21 21:13:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-21 21:13:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-21 21:13:31 [scrapy.core.engine] ERROR: Error while obtaining start requests Traceback (most recent call last): File "d:\python3.5\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request request = next(slot.start_requests) File "d:\python3.5\lib\site-packages\scrapy\spiders\__init__.py", line 83, in start_requests yield Request(url, dont_filter=True) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__ self._set_url(url) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Closing spider (finished) 2017-10-21 21:13:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 567323), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 536236)} 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider closed (finished) _------------------------------分割线--------------------------------------_ 我去查看了一下_init_.py,发现如下语句。 if ':' not in self._url: raise ValueError('Missing scheme in request url: %s' % self._url) 网上的解决方法看了一些,都没有能解决我的问题的,因此在此讨教,望大家指点一二(真没C币了)。提问次数不多,若有格式方面缺陷还请包含。 另附上代码。 #songurls.py import scrapy import re from scrapy.spiders import CrawlSpider, Rule from ..items import SongUrlItem class SongurlsSpider(scrapy.Spider): name = 'songurls' allowed_domains = ['xiami.com'] #将page/1到page/401,这些链接放进start_urls start_url_list=[] url_fixed='http://www.xiami.com/song/tag/Hip-Hop/page/' #将range范围扩大为1-401,获得所有页面 for i in range(1,402): start_url_list.extend([url_fixed+str(i)]) start_urls=start_url_list def parse(self,response): urls=response.xpath('//*[@id="wrapper"]/div[2]/div/div/div[2]/table/tbody/tr/td[2]/a[1]/@href').extract() for url in urls: song_url=response.urljoin(url) url_item=SongUrlItem() url_item['song_url']=song_url yield url_item ------------------------------分割线-------------------------------------- #lyrics.py import scrapy import re class LyricsSpider(scrapy.Spider): name = 'lyrics' allowed_domains = ['xiami.com'] song_url_file='SongUrls.csv' def __init__(self, *args, **kwargs): #从song_url.csv 文件中读取得到所有歌曲url f = open(self.song_url_file,"r") lines = f.readlines() #这里line[:-1]的含义是每行末尾都是一个换行符,要去掉 #这里in lines[1:]的含义是csv第一行是字段名称,要去掉 song_url_list=[line[:-1] for line in lines[1:]] f.close() while '\n' in song_url_list: song_url_list.remove('\n') self.start_urls = song_url_list#[:100]#删除[:100]之后爬取全部数据 def parse(self,response): lyric_lines=response.xpath('//*[@id="lrc"]/div[1]/text()').extract() lyric='' for lyric_line in lyric_lines: lyric+=lyric_line #print lyric lyricItem=LyricItem() lyricItem['lyric']=lyric lyricItem['song_url']=response.url yield lyricItem songinfo因为还没有用到所以不重要。 ------------------------------分割线-------------------------------------- #items.py import scrapy class SongUrlItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 class LyricItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() lyric=scrapy.Field() #歌词 song_url=scrapy.Field() #歌曲链接 class SongInfoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 song_title=scrapy.Field() #歌名 album=scrapy.Field() #专辑 #singer=scrapy.Field() #歌手 language=scrapy.Field() #语种 ------------------------------分割线-------------------------------------- 在middleware下加了几行: sleep_seconds = 0.2 # 模拟点击后休眠3秒,给出浏览器取得响应内容的时间 default_sleep_seconds = 1 # 无动作请求休眠的时间 def process_request(self, request, spider): spider.logger.info('--------Spider request processed: %s' % spider.name) page = None driver = webdriver.PhantomJS() spider.logger.info('--------request.url: %s' % request.url) driver.get(request.url) driver.implicitly_wait(0.2) # 仅休眠数秒加载页面后返回内容 time.sleep(self.sleep_seconds) page = driver.page_source driver.close() return HtmlResponse(request.url, body=page, encoding='utf-8', request=request) ------------------------------分割线-------------------------------------- setting中加了几行也改了几行: from faker import Factory f = Factory.create() USER_AGENT = f.user_agent() DOWNLOAD_DELAY = 0.2 DEFAULT_REQUEST_HEADERS = { 'Host': 'www.xiami.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'no-cache', 'Connection': 'Keep-Alive', } ITEM_PIPELINES = { 'xiami2.pipelines.Xiami2Pipeline': 300, }

scrapy 报错:Missing scheme in request url: h

用Python的scrapy写了一个从网页下图片的爬虫,报错:Missing scheme in request url: h 去百度了也google了都说是相对地址不完整要搞成绝对地址,我用urljoin试了没用,直接用完整的图片地址也没有用。 求大神帮助。 [code=python]import scrapy from imageSpider.items import ImagespiderItem class image_Spider(scrapy.Spider): name="imgSpider" allowed_domains=["image.baidu.com"] start_urls=["http://image.baidu.com/"] def parse(self,response): oriList=response.xpath('//div[@class="img_pic_wrap_layer"]/img/@src').extract() for each in oriList: each=response.urljoin(each) item=ImagespiderItem() item['image_urls']=each yield item[/code] [code=python]# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class ImagespiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls=scrapy.Field() images=scrapy.Field() [/code]

scrapy 爬取遇到问题Filtered duplicate

用scrapy请求站点 http://bigfile.co.kr 的时候,显示Filtered duplicate request:no more duplicates错误,然后就结束了,加上dont_filter=True,重新运行,结果一直死循环,无法结束,也不能爬到东西,有没有大神看一下 ```python name = 'WebSpider' start_urls = ['http://bigfile.co.kr'] headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", 'Referer': 'http://www.baidu.com/', "Upgrade-Insecure-Requests": 1, "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" } def start_requests(self): request = scrapy.Request(url=self.start_urls[0], headers=self.headers, callback=self.parse) request.meta['url'] = self.start_urls[0] yield request ```

Scrapy FormRequest函数中的meta参数值应该如何设置?

我用scrapy进行爬虫,解析函数部分另有下一级回调函数,代码如下: ``` item = SoccerDataItem() for i in range(1, 8): item['player' + str(i + 1)] = players[i] for j in range(1, 8): home_sub_list = response.xpath('//div[@class="left"]//li[@class="pl10"]') if home_sub_list[j - 1].xpath('./span/img[contains(@src,"subs_up")]'): item['player' + str(j)]['name'] = home_sub_list[j - 1].xpath('./div[@class="ml10"]').xpath('string(.)').re_first('\d{1,2}\xa0\xa0(.*)') item['player' + str(j)]['team_stand'] = 1 item['player' + str(j)]['is_startup'] = 0 item['player' + str(j)]['is_subs_up'] = 1 item['player' + str(j)]['subs_up_time'] = home_sub_list[j].xpath('./span/img[contains(@src,"subs_up")]/following-sibling::span').xpath('string(.)').extract_first(default='') yield scrapy.FormRequest(url=data_site, formdata=formdata, meta={'player': item['player' + str(j)]}, callback=self.parse_data) else: item['player' + str(j)]['name'] = home_sub_list[j-1].xpath('./div[@class="ml10"]').xpath('string(.)').re_first('\d{1,2}\xa0\xa0(.*)') item['player' + str(j)]['team_stand'] = 1 item['player' + str(j)]['is_startup'] = 0 item['player' + str(j)]['is_subs_up'] = 0 ``` 然而运行后一直在报错: ``` callback=self.parse_data) File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__ querystr = _urlencode(items, self.encoding) File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 66, in _urlencode for k, vs in seq File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 67, in <listcomp> for v in (vs if is_listlike(vs) else [vs])] File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\utils\python.py", line 119, in to_bytes 'object, got %s' % type(text).__name__) TypeError: to_bytes must receive a unicode, str or bytes object, got int ``` 据本人百度得知,meta当中的键值对的值应为字符串,字节等类型,这正是当我传入字典类型时报错的原因。 可是,请问我应该如何修改此处呢? PS:本人所用编程语言为Python,排版可能会引起读者不适,望谅解!

scrapy爬取知乎首页乱码

爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下: ``` import scrapy HEADERS = { 'Host': 'www.zhihu.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'Origin': 'https://www.zhihu.com', 'Referer': 'https://www.zhihu.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, headers=HEADERS) def parse(self, response): print('========== parse ==========') print(response.text[:100]) body = response.body encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1'] for encoding in encodings: try: print('========== decode ' + encoding) print(body.decode(encoding)[:100]) print('========== decode end\n') except Exception as e: print('########## decode {0}, error: {1}\n'.format(encoding, e)) pass ``` 输出的log如下: D:\workspace_python\ZhihuSpider>scrapy crawl zhihu 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider) 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']} 2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened 2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com/) ========== parse ========== ��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"�� z�I�zLgü^�1� Q)Ա�_k}�䄍���/T����U�3���l��� ========== decode utf-8 ########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte ========== decode gbk ########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode gb2312 ########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode iso-8859-1 áø~!¢ 同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。

scrapy配置问题,求大家帮忙啊

配置scrapy 我是按照http://blog.csdn.net/wukaibo1986/article/details/8167590配置的 创建项目可以 但是运行项目的时候报错,做的demo是按照 http://www.oschina.net/translate/scrapy-demo做的 求解释: E:\爬虫\tutorial>scrapy crawl dmoz 2013-11-20 11:09:50+0800 [scrapy] INFO: Scrapy 0.20.0 started (bot: tutorial) 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Optional features available: ssl, http11 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'} 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState Traceback (most recent call last): File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "C:\Python27\lib\runpy.py", line 72, in _run_code exec code in run_globals File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 168, in <module> execute() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 143, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 89, in _run_print_help func(*a, **kw) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 150, in _run_command cmd.run(args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\commands\crawl.py", line 50, in run self.crawler_process.start() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 92, in start if self.start_crawling(): File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 124, in start_crawling return self._start_crawler() is not None File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 139, in _start_crawler crawler.configure() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 47, in configure self.engine = ExecutionEngine(self, self._spider_closed) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\engine.py", line 63, in __init__ self.downloader = Downloader(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\__init__.py", line 73, in __init__ self.handlers = DownloadHandlers(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 18, in __init__ cls = load_object(clspath) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\utils\misc.py", line 40, in load_object mod = import_module(module) File "C:\Python27\lib\importlib\__init__.py", line 37, in import_module __import__(name) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\s3.py", line 4, in <module> from .http import HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http.py", line 5, in <module> from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 17, in <module> from scrapy.responsetypes import responsetypes File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 113, in <module> responsetypes = ResponseTypes() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 34, in __init__ self.mimetypes = MimeTypes() File "C:\Python27\lib\mimetypes.py", line 66, in __init__ init() File "C:\Python27\lib\mimetypes.py", line 358, in init db.read_windows_registry() File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry for subkeyname in enum_types(hkcr): File "C:\Python27\lib\mimetypes.py", line 249, in enum_types ctype = ctype.encode(default_encoding) # omit in 3.x! UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 9: ordinal not in range(128)

不是说scrapy可以自动处理cookie吗?为什么我用scrapy发送request请求为什么不会自动发送cookie信息?

我在setting中设置COOKIES_ENABLED = True,COOKIES_DEBUG = True,不是说scrapy可以自动处理cookie吗?为什么我用scrapy发送request请求为什么不会自动发送cookie信息?

pyqt5+scrapy传值问题

用pyqt5给爬虫做个界面,但是在界面中的lineEdit文本传不到爬虫中去(要爬微博所以得传一个用于搜索的关键字) 方法是设一个全局变量KEYWORD然后再在界面中用lineEdit修改这个全局变量,最后开启爬虫,读取这个修改后的KEYWORD 无关的函数我都改成pass方便查看- -,为什么这方法有错误,是因为开了另一个线程然后爬虫默认赋值为原来的关键字1 ? # -*- coding: utf-8 -*- KEYWORD = '关键字1' class Ui_Form(object): def setupUi(self, Form): Form.setObjectName("Form") Form.resize(769, 575) self.lineEdit = QLineEdit(Form) self.lineEdit.setGeometry(QRect(130, 50, 161, 21)) self.lineEdit.setObjectName("lineEdit") self.label = QLabel(Form) self.label.setGeometry(QRect(30, 50, 91, 21)) self.label.setObjectName("label") self.pushButton_2 = QPushButton(Form) self.pushButton_2.setGeometry(QRect(550, 40, 81, 41)) self.pushButton_2.setObjectName("pushButton_2") self.pushButton_3 = QPushButton(Form) self.pushButton_3.setGeometry(QRect(330, 40, 81, 41)) self.pushButton_3.setObjectName("pushButton_3") self.pushButton_4 = QPushButton(Form) self.pushButton_4.setGeometry(QRect(440, 40, 81, 41)) self.pushButton_4.setObjectName("pushButton_4") self.pushButton_5 = QPushButton(Form) self.pushButton_5.setGeometry(QRect(660, 40, 81, 41)) self.pushButton_5.setObjectName("pushButton_5") self.pushButton_4.clicked.connect(self.pop2) #开启爬虫 self.pushButton_2.clicked.connect(self.pop1) self.pushButton_3.clicked.connect(self.pop4) #开启cookiespool和修改关键字值 self.pushButton_5.clicked.connect(self.pop5) self.tableView = QTableView(Form) self.tableView.setGeometry(QRect(15, 131, 731, 421)) #设置tableView self.model = QStandardItemModel(1, 6) self.model.setHorizontalHeaderLabels(['作者id', '评论数', '正文', '转发数', '点赞数', 'user']) self.tableView.setEditTriggers(QAbstractItemView.NoEditTriggers) # 只读 self.tableView.resizeColumnsToContents() # 宽度和长度和显示内容相同 self.tableView.setModel(self.model) #设置tableView结束 self.tableView.setObjectName("tableView") self.label_2 = QLabel(Form) self.label_2.setGeometry(QRect(30, 110, 72, 15)) self.label_2.setObjectName("label_2") self.retranslateUi(Form) QMetaObject.connectSlotsByName(Form) def retranslateUi(self, Form): _translate = QCoreApplication.translate Form.setWindowTitle(_translate("Form", "Form")) self.label.setText(_translate("Form", "输入关键字")) self.pushButton_2.setText(_translate("Form", "显示结果")) self.pushButton_3.setText(_translate("Form", "启动服务")) self.pushButton_4.setText(_translate("Form", "开始抓取")) self.pushButton_5.setText(_translate("Form", "结果分析")) self.label_2.setText(_translate("Form", "结果显示")) #槽函数部分 def pop1(self): #从数据库显示数据 pass def pop2(self): #开启爬虫 new.run() def pop3(self): #退出 pass def pop4(self): #开启服务 在这修改关键字 比如传入的时关键字2 global KEYWORD KEYWORD = self.lineEdit.text() print(KEYWORD) #输出会显示关键字2 而不是关键字1 s.start() def pop5(self): #结果显示 pass if __name__ == '__main__': app = QApplication(sys.argv) MainWindow = QMainWindow() ui = Ui_Form() ui.setupUi(MainWindow) MainWindow.show() sys.exit(app.exec_()) #爬虫部分 class WeiboSpider(Spider): client = pymongo.MongoClient(host='127.0.0.1', port=27017) db = client.weibo p = db.weibo name = 'weibo' allowed_domains = ["weibo.cn"] start_url='https://weibo.cn/search/mblog' max_page = 100 count = 0 def start_requests(self): global KEYWORD keyword = KEYWORD #这里获取不到已经修改的关键字 print(keyword) #输出的还是关键字1 url='{url}?keyword={keyword}'.format(url=self.start_url, keyword=keyword) for page in range(self.max_page + 1): data = { 'mp' : str(self.max_page), 'page' : str(page) } yield FormRequest(url, callback=self.parse_index, formdata=data) def parse_index(self, response): pass def comment_detail(self, response): pass new.py 文件内容 from scrapy.crawler import CrawlerProcess from weibosearch.spiders.weibo import WeiboSpider def run(): process = CrawlerProcess() process.crawl(WeiboSpider) process.start()

为什么我直接用requests爬网页可以,但用scrapy不行?

``` class job51(): def __init__(self): self.headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start(self): html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers) self.parse(html) def parse(self,response): tree=lxml.etree.HTML(response.text) resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href') print (resume_url[0] ``` 能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面? ``` class job51(Spider): name = "job51" #allowed_domains = ["my.51job.com"] start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"] headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start_requests(self): yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse) def parse(self,response): #tree=lxml.etree.HTML(text) selector=Selector(response) print ("<<<<<<<<<<<<<<<<<<<<<",response.text) resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href') print (">>>>>>>>>>>>",resume_url) ``` 输出的结果: scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'} 2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened 2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None) 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) <<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script> >>>>>>>>>>>> [] 2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) Traceback (most recent call last): File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse yield Request(resume_url[0],headers=self.headers,callback=self.getResume) File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 628, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 5743, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)} 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)

anaconda安装scrapy之后,测试是否能创建项目出现如下问题,求大佬帮助解决,十分感谢!!

![显示安装scrapy成功了](https://img-ask.csdn.net/upload/201706/28/1498650151_870545.png) 之后输入scrapy startproject a 创建项目的时候出现问题 ![![图片说明](https://img-ask.csdn.net/upload/201706/28/1498650242_926765.png)图片说明](https://img-ask.csdn.net/upload/201706/28/1498650234_489297.png)

【帮帮孩子】scrapy框架请问如何在parse函数中调用已有的参数来构造post请求获得回传的数据包呀

刚接触scrapy框架一周的菜鸟,之前都没用过框架手撸爬虫的,这次遇到了一个问题,我先请求一个网页 ``` def start_requests(self): urls=["http://www.tiku.cn/index/index/questions?cid=14&cno=1&unitid=800417&chapterid=701354&typeid=600122&thrknowid=700137"] for url in urls: yield scrapy.Request(url=url,callback=self.parse) ``` 然后传给parse方法获得了question_ID这个关键参数,然后我想在这里面直接利用这个question_id这个参数构造post请求获得它回传的json数据包并保存在 item['正确答案']之中,请问我要如何实现?,谢谢大佬百忙之中抽空回答我的疑问,谢谢! ``` def parse(self, response): item = TikuItem () for i in range(1,11): QUESTION_ID=str(response.xpath('(/html/body/div[4]/div[2]/div[2]/div['+str(i)+']/div[@class="q-analysis text-l"]/@id)').extract_first()[3:]) item['question_ID']=QUESTION_ID ``` 这是我的items.py文件 ``` class TikuItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() question_ID=scrapy.Field()#题号 correct_answer=scrapy.Field()#正确答案 ```

在中国程序员是青春饭吗?

今年,我也32了 ,为了不给大家误导,咨询了猎头、圈内好友,以及年过35岁的几位老程序员……舍了老脸去揭人家伤疤……希望能给大家以帮助,记得帮我点赞哦。 目录: 你以为的人生 一次又一次的伤害 猎头界的真相 如何应对互联网行业的「中年危机」 一、你以为的人生 刚入行时,拿着傲人的工资,想着好好干,以为我们的人生是这样的: 等真到了那一天,你会发现,你的人生很可能是这样的: ...

技术大佬:我去,你写的 switch 语句也太老土了吧

昨天早上通过远程的方式 review 了两名新来同事的代码,大部分代码都写得很漂亮,严谨的同时注释也很到位,这令我非常满意。但当我看到他们当中有一个人写的 switch 语句时,还是忍不住破口大骂:“我擦,小王,你丫写的 switch 语句也太老土了吧!” 来看看小王写的代码吧,看完不要骂我装逼啊。 private static String createPlayer(PlayerTypes p...

华为初面+综合面试(Java技术面)附上面试题

华为面试整体流程大致分为笔试,性格测试,面试,综合面试,回学校等结果。笔试来说,华为的难度较中等,选择题难度和网易腾讯差不多。最后的代码题,相比下来就简单很多,一共3道题目,前2题很容易就AC,题目已经记不太清楚,不过难度确实不大。最后一题最后提交的代码过了75%的样例,一直没有发现剩下的25%可能存在什么坑。 笔试部分太久远,我就不怎么回忆了。直接将面试。 面试 如果说腾讯的面试是挥金如土...

和黑客斗争的 6 天!

互联网公司工作,很难避免不和黑客们打交道,我呆过的两家互联网公司,几乎每月每天每分钟都有黑客在公司网站上扫描。有的是寻找 Sql 注入的缺口,有的是寻找线上服务器可能存在的漏洞,大部分都...

讲一个程序员如何副业月赚三万的真实故事

loonggg读完需要3分钟速读仅需 1 分钟大家好,我是你们的校长。我之前讲过,这年头,只要肯动脑,肯行动,程序员凭借自己的技术,赚钱的方式还是有很多种的。仅仅靠在公司出卖自己的劳动时...

win10暴力查看wifi密码

刚才邻居打了个电话说:喂小灰,你家wifi的密码是多少,我怎么连不上了。 我。。。 我也忘了哎,就找到了一个好办法,分享给大家: 第一种情况:已经连接上的wifi,怎么知道密码? 打开:控制面板\网络和 Internet\网络连接 然后右击wifi连接的无线网卡,选择状态 然后像下图一样: 第二种情况:前提是我不知道啊,但是我以前知道密码。 此时可以利用dos命令了 1、利用netsh wlan...

上班一个月,后悔当初着急入职的选择了

最近有个老铁,告诉我说,上班一个月,后悔当初着急入职现在公司了。他之前在美图做手机研发,今年美图那边今年也有一波组织优化调整,他是其中一个,在协商离职后,当时捉急找工作上班,因为有房贷供着,不能没有收入来源。所以匆忙选了一家公司,实际上是一个大型外包公司,主要派遣给其他手机厂商做外包项目。**当时承诺待遇还不错,所以就立马入职去上班了。但是后面入职后,发现薪酬待遇这块并不是HR所说那样,那个HR自...

总结了 150 余个神奇网站,你不来瞅瞅吗?

原博客再更新,可能就没了,之后将持续更新本篇博客。

副业收入是我做程序媛的3倍,工作外的B面人生是怎样的?

提到“程序员”,多数人脑海里首先想到的大约是:为人木讷、薪水超高、工作枯燥…… 然而,当离开工作岗位,撕去层层标签,脱下“程序员”这身外套,有的人生动又有趣,马上展现出了完全不同的A/B面人生! 不论是简单的爱好,还是正经的副业,他们都干得同样出色。偶尔,还能和程序员的特质结合,产生奇妙的“化学反应”。 @Charlotte:平日素颜示人,周末美妆博主 大家都以为程序媛也个个不修边幅,但我们也许...

如果你是老板,你会不会踢了这样的员工?

有个好朋友ZS,是技术总监,昨天问我:“有一个老下属,跟了我很多年,做事勤勤恳恳,主动性也很好。但随着公司的发展,他的进步速度,跟不上团队的步伐了,有点...

我入职阿里后,才知道原来简历这么写

私下里,有不少读者问我:“二哥,如何才能写出一份专业的技术简历呢?我总感觉自己写的简历太烂了,所以投了无数份,都石沉大海了。”说实话,我自己好多年没有写过简历了,但我认识的一个同行,他在阿里,给我说了一些他当年写简历的方法论,我感觉太牛逼了,实在是忍不住,就分享了出来,希望能够帮助到你。 01、简历的本质 作为简历的撰写者,你必须要搞清楚一点,简历的本质是什么,它就是为了来销售你的价值主张的。往深...

带了6个月的徒弟当了面试官,而身为高级工程师的我天天修Bug......

即将毕业的应届毕业生一枚,现在只拿到了两家offer,但最近听到一些消息,其中一个offer,我这个组据说客户很少,很有可能整组被裁掉。 想问大家: 如果我刚入职这个组就被裁了怎么办呢? 大家都是什么时候知道自己要被裁了的? 面试软技能指导: BQ/Project/Resume 试听内容: 除了刷题,还有哪些技能是拿到offer不可或缺的要素 如何提升面试软实力:简历, 行为面试,沟通能...

!大部分程序员只会写3年代码

如果世界上都是这种不思进取的软件公司,那别说大部分程序员只会写 3 年代码,恐怕就没有程序员这种职业。

离职半年了,老东家又发 offer,回不回?

有小伙伴问松哥这个问题,他在上海某公司,在离职了几个月后,前公司的领导联系到他,希望他能够返聘回去,他很纠结要不要回去? 俗话说好马不吃回头草,但是这个小伙伴既然感到纠结了,我觉得至少说明了两个问题:1.曾经的公司还不错;2.现在的日子也不是很如意。否则应该就不会纠结了。 老实说,松哥之前也有过类似的经历,今天就来和小伙伴们聊聊回头草到底吃不吃。 首先一个基本观点,就是离职了也没必要和老东家弄的苦...

HTTP与HTTPS的区别

面试官问HTTP与HTTPS的区别,我这样回答让他竖起大拇指!

程序员毕业去大公司好还是小公司好?

虽然大公司并不是人人都能进,但我仍建议还未毕业的同学,尽力地通过校招向大公司挤,但凡挤进去,你这一生会容易很多。 大公司哪里好?没能进大公司怎么办?答案都在这里了,记得帮我点赞哦。 目录: 技术氛围 内部晋升与跳槽 啥也没学会,公司倒闭了? 不同的人脉圈,注定会有不同的结果 没能去大厂怎么办? 一、技术氛围 纵观整个程序员技术领域,哪个在行业有所名气的大牛,不是在大厂? 而且众所...

程序员为什么千万不要瞎努力?

本文作者用对比非常鲜明的两个开发团队的故事,讲解了敏捷开发之道 —— 如果你的团队缺乏统一标准的环境,那么即使勤劳努力,不仅会极其耗时而且成果甚微,使用...

为什么程序员做外包会被瞧不起?

二哥,有个事想询问下您的意见,您觉得应届生值得去外包吗?公司虽然挺大的,中xx,但待遇感觉挺低,马上要报到,挺纠结的。

当HR压你价,说你只值7K,你该怎么回答?

当HR压你价,说你只值7K时,你可以流畅地回答,记住,是流畅,不能犹豫。 礼貌地说:“7K是吗?了解了。嗯~其实我对贵司的面试官印象很好。只不过,现在我的手头上已经有一份11K的offer。来面试,主要也是自己对贵司挺有兴趣的,所以过来看看……”(未完) 这段话主要是陪HR互诈的同时,从公司兴趣,公司职员印象上,都给予对方正面的肯定,既能提升HR的好感度,又能让谈判气氛融洽,为后面的发挥留足空间。...

面试:第十六章:Java中级开发(16k)

HashMap底层实现原理,红黑树,B+树,B树的结构原理 Spring的AOP和IOC是什么?它们常见的使用场景有哪些?Spring事务,事务的属性,传播行为,数据库隔离级别 Spring和SpringMVC,MyBatis以及SpringBoot的注解分别有哪些?SpringMVC的工作原理,SpringBoot框架的优点,MyBatis框架的优点 SpringCould组件有哪些,他们...

面试阿里p7,被按在地上摩擦,鬼知道我经历了什么?

面试阿里p7被问到的问题(当时我只知道第一个):@Conditional是做什么的?@Conditional多个条件是什么逻辑关系?条件判断在什么时候执...

终于懂了TCP和UDP协议区别

终于懂了TCP和UDP协议区别

Python爬虫,高清美图我全都要(彼岸桌面壁纸)

爬取彼岸桌面网站较为简单,用到了requests、lxml、Beautiful Soup4

无代码时代来临,程序员如何保住饭碗?

编程语言层出不穷,从最初的机器语言到如今2500种以上的高级语言,程序员们大呼“学到头秃”。程序员一边面临编程语言不断推陈出新,一边面临由于许多代码已存在,程序员编写新应用程序时存在重复“搬砖”的现象。 无代码/低代码编程应运而生。无代码/低代码是一种创建应用的方法,它可以让开发者使用最少的编码知识来快速开发应用程序。开发者通过图形界面中,可视化建模来组装和配置应用程序。这样一来,开发者直...

面试了一个 31 岁程序员,让我有所触动,30岁以上的程序员该何去何从?

最近面试了一个31岁8年经验的程序猿,让我有点感慨,大龄程序猿该何去何从。

大三实习生,字节跳动面经分享,已拿Offer

说实话,自己的算法,我一个不会,太难了吧

程序员垃圾简历长什么样?

已经连续五年参加大厂校招、社招的技术面试工作,简历看的不下于万份 这篇文章会用实例告诉你,什么是差的程序员简历! 疫情快要结束了,各个公司也都开始春招了,作为即将红遍大江南北的新晋UP主,那当然要为小伙伴们做点事(手动狗头)。 就在公众号里公开征简历,义务帮大家看,并一一点评。《启舰:春招在即,义务帮大家看看简历吧》 一石激起千层浪,三天收到两百多封简历。 花光了两个星期的所有空闲时...

美团面试,问了ThreadLocal原理,这个回答让我通过了

他想都想不到,ThreadLocal我烂熟于心

大牛都会用的IDEA调试技巧!!!

导读 前天面试了一个985高校的实习生,问了他平时用什么开发工具,他想也没想的说IDEA,于是我抛砖引玉的问了一下IDEA的调试用过吧,你说说怎么设置断点...

面试官:你连SSO都不懂,就别来面试了

大厂竟然要考我SSO,卧槽。

立即提问
相关内容推荐