python3.7爬虫 使用 selector.xpath('')爬取-线上等

我想用正则取以下的

            LANEIGE (兰芝) Deep Pore Cleansing Foam

            字串即可,请用我用python   selector.xpath('') 的语法要怎么写呢

            '''''''''''''''''''''''''''''''''''''''''''''''
            <div class="col-md-10" >
        <div style="margin:10px 0px; float:left;width:100%;">

            <div style="float:left;margin-left:10px;margin-top:20px;">
                <p> 
                    <span class="fullName">LANEIGE (兰芝) Deep Pore Cleansing Foam</span>

                    <span class="certificate">

                    </span>
                </p>

                <p class="member">

                    <span class="createDttm">
                        更新于:&nbsp;2018-12-02

                        ,&nbsp;资料来源: 其他(非官方)网站

                    </span>
                </p>
            </div>


        </div>
        ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
        www.cosgua.com/cosmetic/1000(原网址)

2个回答

from scrapy.selector import Selector


 html='''
            <div class="col-md-10" >
        <div style="margin:10px 0px; float:left;width:100%;">

            <div style="float:left;margin-left:10px;margin-top:20px;">
                <p> 
                    <span class="fullName">LANEIGE (兰芝) Deep Pore Cleansing Foam</span>

                    <span class="certificate">

                    </span>
                </p>

                <p class="member">

                    <span class="createDttm">
                        更新于:&nbsp;2018-12-02

                        ,&nbsp;资料来源: 其他(非官方)网站

                    </span>
                </p>
            </div>


        </div>
'''
selector = Selector(text=html)
txt=selector.xpath('//div/p/span[@class="fullName"]/text()').extract_first()
print(txt)
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
一个百度拇指医生爬虫,想要先实现爬取某个问题的所有链接,但是爬不出来东西。求各位大神帮忙看一下这是为什么?
#写在前面的话 在这个爬虫里我想实现把百度拇指医生里关于“咳嗽”的链接全部爬取下来,下一步要进行的是把爬取到的每个链接里的items里面的内容爬取下来,但是我在第一步就卡住了,求各位大神帮我看一下吧。之前刚刚发了一篇问答,但是不知道怎么回事儿,现在找不到了,(貌似是被删了...?)救救小白吧!感激不尽! 这个是我的爬虫的结构 ![图片说明](https://img-ask.csdn.net/upload/201911/27/1574787999_274479.png) ##ks: ``` # -*- coding: utf-8 -*- import scrapy from kesou.items import KesouItem from scrapy.selector import Selector from scrapy.spiders import Spider from scrapy.http import Request ,FormRequest import pymongo class KsSpider(scrapy.Spider): name = 'ks' allowed_domains = ['kesou,baidu.com'] start_urls = ['https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M'] def parse(self, response): item = KesouItem() contents = response.xpath('.//h3[@class="t"]') for content in contents: url = content.xpath('.//a/@href').extract()[0] item['url'] = url yield item if self.offset < 760: self.offset += 10 yield scrapy.Request(url = "https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=" + str(self.offset) + "&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M",callback=self.parse,dont_filter=True) ``` ##items: ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class KesouItem(scrapy.Item): # 问题ID question_ID = scrapy.Field() # 问题描述 question = scrapy.Field() # 医生回答发表时间 answer_pubtime = scrapy.Field() # 问题详情 description = scrapy.Field() # 医生姓名 doctor_name = scrapy.Field() # 医生职位 doctor_title = scrapy.Field() # 医生所在医院 hospital = scrapy.Field() ``` ##middlewares: ``` # -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class KesouSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class KesouDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) ``` ##piplines: ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy.utils.project import get_project_settings settings = get_project_settings() class KesouPipeline(object): def __init__(self): host = settings["MONGODB_HOST"] port = settings["MONGODB_PORT"] dbname = settings["MONGODB_DBNAME"] sheetname= settings["MONGODB_SHEETNAME"] # 创建MONGODB数据库链接 client = pymongo.MongoClient(host = host, port = port) # 指定数据库 mydb = client[dbname] # 存放数据的数据库表名 self.sheet = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.sheet.insert(data) return item ``` ##settings: ``` # -*- coding: utf-8 -*- # Scrapy settings for kesou project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'kesou' SPIDER_MODULES = ['kesou.spiders'] NEWSPIDER_MODULE = 'kesou.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'kesou (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0" # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'kesou.middlewares.KesouSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'kesou.middlewares.KesouDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'kesou.pipelines.KesouPipeline': 300, } # MONGODB 主机名 MONGODB_HOST = "127.0.0.1" # MONGODB 端口号 MONGODB_PORT = 27017 # 数据库名称 MONGODB_DBNAME = "ks" # 存放数据的表名称 MONGODB_SHEETNAME = "ks_urls" # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ``` ##run.py: ``` # -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl ks".split()) ``` ##这个是运行出来的结果: ``` PS D:\scrapy_project\kesou> scrapy crawl ks 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: kesou) 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twis.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryphy 2.6.1, Platform Windows-10-10.0.18362-SP0 2019-11-27 00:14:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'kesou', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['kesou.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67 2019-11-27 00:14:17 [scrapy.extensions.telnet] INFO: Telnet Password: 051629c46f34abdf 2019-11-27 00:14:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled item pipelines: ['kesou.pipelines.KesouPipeline'] 2019-11-27 00:14:19 [scrapy.core.engine] INFO: Spider opened 2019-11-27 00:14:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-27 00:14:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-27 00:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX% (referer: None) 2019-11-27 00:14:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFFSYX%2B1M> (referer: None) Traceback (most recent call last): File "d:\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "D:\scrapy_project\kesou\kesou\spiders\ks.py", line 19, in parse item['url'] = url File "d:\anaconda3\lib\site-packages\scrapy\item.py", line 73, in __setitem__ (self.__class__.__name__, key)) KeyError: 'KesouItem does not support field: url' 2019-11-27 00:14:20 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2019, 11, 26, 16, 14, 19, 863597)} 2019-11-27 00:14:21 [scrapy.core.engine] INFO: Spider closed (finished) ```
python scrapy爬虫 抓取的内容只有一条,怎么破??
目标URL:http://218.92.23.142/sjsz/szxx/Index.aspx(工作需要) 主要目的是爬取网站中的信件类型、信件主题、写信时间、回复时间、回复状态以及其中链接里面的具体内容,然后保存到excel表格中。里面的链接全部都是POST方法,没有出现一个具体的链接,所以我感觉非常恼火。 目前碰到的问题: 1、 但是我只能抓到第一条的信息,后面就抓不到了。具体是这条:市长您好: 我是一名事... 2、 scrapy运行后出现的信息是: 15:01:33 [scrapy] INFO: Scrapy 1.0.3 started (bot: spider2) 2016-01-13 15:01:33 [scrapy] INFO: Optional features available: ssl, http11 2016-01-13 15:01:33 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spider2.spiders', 'FEED_URI': u'file:///F:/\u5feb\u76d8/workspace/Pythontest/src/Scrapy/spider2/szxx.csv', 'SPIDER_MODULES': ['spider2.spiders'], 'BOT_NAME': 'spider2', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'CSV'} 2016-01-13 15:01:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState 2016-01-13 15:01:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-01-13 15:01:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-01-13 15:01:38 [scrapy] INFO: Enabled item pipelines: 2016-01-13 15:01:38 [scrapy] INFO: Spider opened 2016-01-13 15:01:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-01-13 15:01:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <GET http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: None) 2016-01-13 15:01:39 [scrapy] DEBUG: Filtered duplicate request: <GET http://218.92.23.142/sjsz/szxx/Index.aspx> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <GET http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <POST http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) 2016-01-13 15:01:39 [scrapy] DEBUG: Redirecting (302) to <GET http://218.92.23.142/sjsz/szxx/GkResult.aspx?infoid=3160105094757> from <POST http://218.92.23.142/sjsz/szxx/Index.aspx> 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <GET http://218.92.23.142/sjsz/szxx/GkResult.aspx?infoid=3160105094757> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) 2016-01-13 15:01:39 [scrapy] DEBUG: Scraped from <200 http://218.92.23.142/sjsz/szxx/GkResult.aspx?infoid=3160105094757> 第一条的信息(太多了,就省略了。。。。) 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <POST http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) ………… 后面的差不多,就不写出来了 2016-01-13 15:01:41 [scrapy] INFO: Stored csv feed (1 items) in: file:///F:/快盘/workspace/Pythontest/src/Scrapy/spider2/szxx.csv 2016-01-13 15:01:41 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 56383, 'downloader/request_count': 17, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 14, 'downloader/response_bytes': 118855, 'downloader/response_count': 17, 'downloader/response_status_count/200': 16, 'downloader/response_status_count/302': 1, 'dupefilter/filtered': 120, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2016, 1, 13, 7, 1, 41, 716000), 'item_scraped_count': 1, 'log_count/DEBUG': 20, 'log_count/INFO': 8, 'request_depth_max': 14, 'response_received_count': 16, 'scheduler/dequeued': 17, 'scheduler/dequeued/memory': 17, 'scheduler/enqueued': 17, 'scheduler/enqueued/memory': 17, 'start_time': datetime.datetime(2016, 1, 13, 7, 1, 38, 670000)} 2016-01-13 15:01:41 [scrapy] INFO: Spider closed (finished) 具体的代码如下(代码写的不好,误喷): import sys, copy reload(sys) sys.setdefaultencoding('utf-8') sys.path.append("../") from scrapy.spiders import CrawlSpider from scrapy.http import FormRequest, Request from scrapy.selector import Selector from items import Spider2Item class Domeszxx(CrawlSpider): name = "szxx" allowed_domain = ["218.92.23.142"] start_urls = ["http://218.92.23.142/sjsz/szxx/Index.aspx"] item = Spider2Item() def parse(self, response): selector = Selector(response) # 获得下一页的POST参数 viewstate = ''.join(selector.xpath('//input[@id="__VIEWSTATE"]/@value').extract()[0]) eventvalidation = ''.join(selector.xpath('//input[@id="__EVENTVALIDATION"]/@value').extract()[0]) nextpage = ''.join( selector.xpath('//input[@name="ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage"]/@value').extract()) nextpage_data = { '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1$ctl12$cmdNext', '__EVENTARGUMENT': '', '__VIEWSTATE': viewstate, '__VIEWSTATEGENERATOR': '9DEFE542', '__EVENTVALIDATION': eventvalidation, 'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage } # 获得抓取当前内容的xpath xjlx = ".//*[@id='ContentPlaceHolder1_GridView1_Label2_" xjzt = ".//*[@id='ContentPlaceHolder1_GridView1_LinkButton5_" xxsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label4_" hfsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label5_" nextlink = '//*[@id="ContentPlaceHolder1_GridView1_cmdNext"]/@href' # 获取当前页面公开答复的行数 listnum = len(selector.xpath('//tr')) - 2 # 获得抓取内容 for i in range(0, listnum): item_all = {} xjlx_xpath = xjlx + str(i) + "']/text()" xjzt_xpath = xjzt + str(i) + "']/text()" xxsj_xpath = xxsj + str(i) + "']/text()" hfsj_xpath = hfsj + str(i) + "']/text()" # 信件类型 item_all['xjlx'] = selector.xpath(xjlx_xpath).extract()[0].decode('utf-8').encode('gbk') # 信件主题 item_all['xjzt'] = str(selector.xpath(xjzt_xpath).extract()[0].decode('utf-8').encode('gbk')).replace('\n', '') # 写信时间 item_all['xxsj'] = selector.xpath(xxsj_xpath).extract()[0].decode('utf-8').encode('gbk') # 回复时间 item_all['hfsj'] = selector.xpath(hfsj_xpath).extract()[0].decode('utf-8').encode('gbk') # 获取二级页面中的POST参数 eventtaget = 'ctl00$ContentPlaceHolder1$GridView1$ctl0' + str(i + 2) + '$LinkButton5' content_data = { '__EVENTTARGET': eventtaget, '__EVENTARGUMENT': '', '__VIEWSTATE': viewstate, '__VIEWSTATEGENERATOR': '9DEFE542', '__EVENTVALIDATION': eventvalidation, 'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage } # 完成抓取信息的传递 yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.send_value, meta={'item_all': item_all, 'content_data': content_data}) # 进入页面中的二级页面的链接,必须利用POST方法才能提交,无法看到直接的URL,同时将本页中抓取的item和进入下一页的POST方法进行传递 # yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.getcontent, # meta={'item': item_all}) # yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=content_data, # callback=self.getcontent) # 进入下一页 if selector.xpath(nextlink).extract(): yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=nextpage_data, callback=self.parse) # 将当前页面的值传递到本函数并存入类的item中 def send_value(self, response): itemx = response.meta['item_all'] post_data = response.meta['content_data'] Domeszxx.item = copy.deepcopy(itemx) yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=post_data, callback=self.getcontent) return # 将二级链接中值抓取并存入类的item中 def getcontent(self, response): item_getcontent = { 'xfr': ''.join(response.xpath('//*[@id="lblXFName"]/text()').extract()).decode('utf-8').encode('gbk'), 'lxnr': ''.join(response.xpath('//*[@id="lblXFQuestion"]/text()').extract()).decode('utf-8').encode( 'gbk'), 'hfnr': ''.join(response.xpath('//*[@id="lblXFanswer"]/text()').extract()).decode('utf-8').encode( 'gbk')} Domeszxx.item.update(item_getcontent) yield Domeszxx.item return
scrapy爬虫不能自动爬取所有页面
学习scrapy第三天,在爬取[wooyun白帽子精华榜](http://wooyun.org/whitehats/do/1/page/1 "")的时候,不能爬取所有的页面。 items.py ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class WooyunrankautoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() ''' 以下信息分别为 注册日期 woyun昵称 精华漏洞数 精华比例 wooyun个人主页 ''' register_date = scrapy.Field() nick_name = scrapy.Field() rank_level = scrapy.Field() essence_count = scrapy.Field() essence_ratio = scrapy.Field() ``` pipelines.py ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import sys import csv class WooyunrankautoPipeline(object): ''' process the item returned from the spider ''' def __init__(self): reload(sys) if sys.getdefaultencoding()!="utf-8": sys.setdefaultencoding("utf-8") file_obj = open("wooyunrank.csv","wb") fieldnames = ["register_date","nick_name","rank_level","essence_count","essence_ratio"] self.dict_writer = csv.DictWriter(file_obj,fieldnames=fieldnames) self.dict_writer.writeheader() def process_item(self,item,spider): self.dict_writer.writerow(item) return item ``` spider.py ```python #!/usr/bin/python # -*- coding:utf-8 -*- import sys from scrapy.spider import Spider from scrapy.selector import Selector from wooyunrankauto.items import WooyunrankautoItem from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors import LinkExtractor class WooyunSpider(CrawlSpider): ''' 爬取wooyun漏洞精华榜单 ''' name = "wooyunrankauto" # 爬取速度为1s download_delay = 2 allowed_domains = ["wooyun.org"] start_urls = [ "http://wooyun.org/whitehats/do/1/page/1" ] rules=[ Rule(LinkExtractor(allow=("/whitehats/do/1/page/\d+")),follow=True,callback='parse_item') ] # def __init__(self): # reload(sys) # if sys.getdefaultencoding()!="utf-8": # sys.setdefaultencoding("utf-8") def parse_item(self,response): sel = Selector(response) infos = sel.xpath("/html/body/div[5]/table/tbody/tr") items = [] for info in infos: item = WooyunrankautoItem() item["register_date"] = info.xpath("th[1]/text()").extract()[0] item["rank_level"] = info.xpath("th[2]/text()").extract()[0] item["essence_count"] = info.xpath("th[3]/text()").extract()[0] item["essence_ratio"] = info.xpath("th[4]/text()").extract()[0] item["nick_name"] = info.xpath("td/a/text()").extract()[0] items.append(item) return items ``` 上面的spider.py只能爬取1,2,3,4,5页(日志中显示爬取六次,第一页被重复爬取了) 但是浏览第5页的时候,6,7,8,9页也会出现啊,这里为什么没有爬取到6,7,8,9 第二个版本的spider.py ``` def parse_item(self,response): sel = Selector(response) infos = sel.xpath("/html/body/div[5]/table/tbody/tr") items = [] for info in infos: item = WooyunrankautoItem() item["register_date"] = info.xpath("th[1]/text()").extract()[0] item["rank_level"] = info.xpath("th[2]/text()").extract()[0] item["essence_count"] = info.xpath("th[3]/text()").extract()[0] item["essence_ratio"] = info.xpath("th[4]/text()").extract()[0] item["nick_name"] = info.xpath("td/a/text()").extract()[0] items.append(item) return item ``` 这个版本可以爬取所有页面,但是每个页面有20条信息,我只能取到第一条信息(循环第一条的时候就返回了,这里可以理解)但是为什么这里就可以爬取所有页面 可能是我对scrapy理解还不深入,这里实在不知道什么问题了,我想自动爬取所有页面(而且不会重复爬取),每个页面有20条信息,应该就是20个item。
Scrapy爬取下来的数据不全,为什么总会有遗漏?
本人小白一枚,刚接触Scrapy框架没多久,写了一个简单的Spider,但是发现每一次爬取后的结果都比网页上的真实数据量要少,比如网站上一共有100条,但我爬下来的结果一般会少几条至几十条不等,很少有100条齐的时候。 整个爬虫有两部分,一部分是页面的横向爬取(进入下一页),另一个是纵向的爬取(进入页面中每一产品的详细页面)。之前我一直以为是pipelines存储到excel的时候数据丢失了,后来经过Debug调试,发现是在Spider中,数据就遗漏了,def parse函数中的item数量是齐的,包括yield Request加入到队列中,但是调用def parse_item函数时,就有些产品的详细页面无法进入。这是什么原因呢,是因为Scrapy异步加载受网速之类的影响么,本身就有缺陷,还是说是我设计上面的问题?有什么解决的方法么,不然数据量一大那丢失的不是就很严重么。 求帮助,谢谢各位了。 ``` class MyFirstSpider(Spider): name = "MyFirstSpider" allowed_doamins = ["e-shenhua.com"] start_urls = ["https://www.e-shenhua.com/ec/auction/oilAuctionList.jsp?_DARGS=/ec/auction/oilAuctionList.jsp"] url = 'https://www.e-shenhua.com/ec/auction/oilAuctionList.jsp' def parse(self, response): items = [] selector = Selector(response) contents = selector.xpath('//table[@class="table expandable table-striped"]/tbody/tr') urldomain = 'https://www.e-shenhua.com' for content in contents: item = CyfirstItem() productId = content.xpath('td/a/text()').extract()[0].strip() productUrl = content.xpath('td/a/@href').extract()[0] totalUrl = urldomain + productUrl productName = content.xpath('td/a/text()').extract()[1].strip() deliveryArea = content.xpath('td/text()').extract()[-5].strip() saleUnit = content.xpath('td/text()').extract()[-4] item['productId'] = productId item['totalUrl'] = totalUrl item['productName'] = productName item['deliveryArea'] = deliveryArea item['saleUnit'] = saleUnit items.append(item) print(len(items)) # **************进入每个产品的子网页 for item in items: yield Request(item['totalUrl'],meta={'item':item},callback=self.parse_item) # print(item['productId']) # 下一页的跳转 nowpage = selector.xpath('//div[@class="pagination pagination-small"]/ul/li[@class="active"]/a/text()').extract()[0] nextpage = int(nowpage) + 1 str_nextpage = str(nextpage) nextLink = selector.xpath('//div[@class="pagination pagination-small"]/ul/li[last()]/a/@onclick').extract() if (len(nextLink)): yield scrapy.FormRequest.from_response(response, formdata={ *************** }, callback = self.parse ) # 产品子网页内容的抓取 def parse_item(self,response): sel = Selector(response) item = response.meta['item'] # print(item['productId']) productInfo = sel.xpath('//div[@id="content-products-info"]/table/tbody/tr') titalBidQty = ''.join(productInfo.xpath('td[3]/text()').extract()).strip() titalBidUnit = ''.join(productInfo.xpath('td[3]/span/text()').extract()) titalBid = titalBidQty + " " +titalBidUnit minBuyQty = ''.join(productInfo.xpath('td[4]/text()').extract()).strip() minBuyUnit = ''.join(productInfo.xpath('td[4]/span/text()').extract()) minBuy = minBuyQty + " " + minBuyUnit isminVarUnit = ''.join(sel.xpath('//div[@id="content-products-info"]/table/thead/tr/th[5]/text()').extract()) if(isminVarUnit == '最小变量单位'): minVarUnitsl = ''.join(productInfo.xpath('td[5]/text()').extract()).strip() minVarUnitdw = ''.join(productInfo.xpath('td[5]/span/text()').extract()) minVarUnit = minVarUnitsl + " " + minVarUnitdw startPrice = ''.join(productInfo.xpath('td[6]/text()').extract()).strip().rstrip('/') minAddUnit = ''.join(productInfo.xpath('td[7]/text()').extract()).strip() else: minVarUnit = '' startPrice = ''.join(productInfo.xpath('td[5]/text()').extract()).strip().rstrip('/') minAddUnit = ''.join(productInfo.xpath('td[6]/text()').extract()).strip() item['titalBid'] = titalBid item['minBuyQty'] = minBuy item['minVarUnit'] = minVarUnit item['startPrice'] = startPrice item['minAddUnit'] = minAddUnit # print(item) return item ```
为什么我直接用requests爬网页可以,但用scrapy不行?
``` class job51(): def __init__(self): self.headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start(self): html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers) self.parse(html) def parse(self,response): tree=lxml.etree.HTML(response.text) resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href') print (resume_url[0] ``` 能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面? ``` class job51(Spider): name = "job51" #allowed_domains = ["my.51job.com"] start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"] headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start_requests(self): yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse) def parse(self,response): #tree=lxml.etree.HTML(text) selector=Selector(response) print ("<<<<<<<<<<<<<<<<<<<<<",response.text) resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href') print (">>>>>>>>>>>>",resume_url) ``` 输出的结果: scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'} 2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened 2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None) 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) <<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script> >>>>>>>>>>>> [] 2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) Traceback (most recent call last): File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse yield Request(resume_url[0],headers=self.headers,callback=self.getResume) File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 628, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 5743, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)} 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)
非常简单的scrapy代码但就是不清楚到底哪里出问题了,高手帮忙看看吧!
News_spider文件 # -*- coding: utf-8 -*- import scrapy import re from scrapy import Selector from News.items import NewsItem class NewsSpiderSpider(scrapy.Spider): name = "news_spider" allowed_domains = ["http://18.92.0.1"] start_urls = ['http://18.92.0.1/contents/7/121174.html'] def parse_detail(self, response): sel = Selector(response) items = [] item = NewsItem() item['title'] = sel.css('.div_bt::text').extract()[0] characters = sel.css('.div_zz::text').extract()[0].replace("\xa0","") pattern = re.compile('[:].*[ ]') result = pattern.search(characters) item['post'] = result.group().replace(":","").strip() pattern = re.compile('[ ][^发]*') result = pattern.search(characters) item['approver'] = result.group() pattern = re.compile('[201].{9}') result = pattern.search(characters) item['date_of_publication'] = result.group() pattern = re.compile('([0-9]+)$') result = pattern.search(characters) item['browse_times'] = result.group() content = sel.css('.xwnr').extract()[0] pattern = re.compile('[\u4e00-\u9fa5]|[,、。“”]') result = pattern.findall(content) item['content'] = ''.join(result).replace("仿宋"," ").replace("宋体"," ").replace("楷体"," ") item['img1_url'] = sel.xpath('//*[@id="newpic"]/div[1]/div[1]/img/@src').extract()[0] item['img1_name'] = sel.xpath('//*[@id="newpic"]/div[1]/div[2]/text()').extract()[0] item['img2_url'] = sel.xpath('//*[@id="newpic"]/div[2]/div[1]/img/@src').extract()[0] item['img2_name'] = sel.xpath('//*[@id="newpic"]/div[2]/div[2]').extract()[0] item['img3_url'] = sel.xpath('//*[@id="newpic"]/div[3]/div[1]/img/@src').extract()[0] item['img3_name'] = sel.xpath('//*[@id="newpic"]/div[3]/div[2]/text()').extract()[0] item['img4_url'] = sel.xpath('//*[@id="newpic"]/div[4]/div[1]/img/@src').extract()[0] item['img4_name'] = sel.xpath('//*[@id="newpic"]/div[4]/div[2]/text()').extract()[0] item['img5_url'] = sel.xpath('//*[@id="newpic"]/div[5]/div[1]/img/@src').extract()[0] item['img5_name'] = sel.xpath('//*[@id="newpic"]/div[5]/div[2]/text()').extract()[0] item['img6_url'] = sel.xpath('//*[@id="newpic"]/div[6]/div[1]/img/@src').extract()[0] item['img6_name'] = sel.xpath('//*[@id="newpic"]/div[6]/div[2]/text()').extract()[0] characters = sel.xpath('/html/body/div/div[2]/div[4]/div[4]/text()').extract()[0].replace("\xa0","") pattern = re.compile('[:].*?[ ]') result = pattern.search(characters) item['company'] = result.group().replace(":", "").strip() pattern = re.compile('[ ][^联]*') result = pattern.search(characters) item['writer_photography'] = result.group() pattern = re.compile('(([0-9]|[-])+)$') result = pattern.search(characters) item['tel'] = result.group() items.append(item) items文件 return items import scrapy class NewsItem(scrapy.Item): title = scrapy.Field() post = scrapy.Field() approver = scrapy.Field() date_of_publication = scrapy.Field() browse_times = scrapy.Field() content = scrapy.Field() img1_url = scrapy.Field() img1_name = scrapy.Field() img2_url = scrapy.Field() img2_name = scrapy.Field() img3_url = scrapy.Field() img3_name = scrapy.Field() img4_url = scrapy.Field() img4_name = scrapy.Field() img5_url = scrapy.Field() img5_name = scrapy.Field() img6_url = scrapy.Field() img6_name = scrapy.Field() company = scrapy.Field() writer_photography = scrapy.Field() tel = scrapy.Field() pipelines文件 import MySQLdb import MySQLdb.cursors class NewsPipeline(object): def process_item(self, item, spider): return item class MysqlPipeline(object): def __init__(self): self.conn = MySQLdb.connect('192.168.254.129','root','root','news',charset="utf8",use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = "insert into news_table(title,post,approver,date_of_publication,browse_times,content,img1_url,img1_name,img2_url,img2_name,img3_url,img3_name,img4_url,img4_name,img5_url,img5_name,img6_url,img6_name,company,writer_photography,tel)VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)" self.cursor.execute(insert_sql,(item['title'],item['post'],item['approver'],item['date_of_publication'],item['browse_times'],item['content'],item['img1_url'],item['img1_name'],item['img1_url'],item['img1_name'],item['img2_url'],item['img2_name'],item['img3_url'],item['img3_name'],item['img4_url'],item['img4_name'],item['img5_url'],item['img5_name'],item['img6_url'],item['img6_name'],item['company'],item['writer_photography'],item['tel'])) self.conn.commit() setting文件 BOT_NAME = 'News' SPIDER_MODULES = ['News.spiders'] NEWSPIDER_MODULE = 'News.spiders' ROBOTSTXT_OBEY = False COOKIES_ENABLED = True ITEM_PIPELINES = { #'News.pipelines.NewsPipeline': 300, 'News.pipelines.MysqlPipeline': 300, } /usr/bin/python3.5 /home/pzs/PycharmProjects/News/main.py 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: News) 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'News', 'SPIDER_MODULES': ['News.spiders'], 'NEWSPIDER_MODULE': 'News.spiders'} 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled item pipelines: ['News.pipelines.MysqlPipeline'] 2017-04-08 11:00:12 [scrapy.core.engine] INFO: Spider opened 2017-04-08 11:00:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 11:00:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-08 11:00:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://18.92.0.1/contents/7/121174.html> (referer: None) 2017-04-08 11:00:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://18.92.0.1/contents/7/121174.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.5/dist-packages/scrapy/spiders/__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-08 11:00:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 16609, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 8, 18, 0, 13, 938637), 'log_count/DEBUG': 2, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 4, 8, 18, 0, 12, 917719)} 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 直接运行会弹出NotImplementedError错误,单步调试也看不出到底哪里出了问题
动态规划入门到熟悉,看不懂来打我啊
持续更新。。。。。。 2.1斐波那契系列问题 2.2矩阵系列问题 2.3跳跃系列问题 3.1 01背包 3.2 完全背包 3.3多重背包 3.4 一些变形选讲 2.1斐波那契系列问题 在数学上,斐波纳契数列以如下被以递归的方法定义:F(0)=0,F(1)=1, F(n)=F(n-1)+F(n-2)(n&gt;=2,n∈N*)根据定义,前十项为1, 1, 2, 3...
130 个相见恨晚的超实用网站,一次性分享出来
相见恨晚的超实用网站 持续更新中。。。
Java学习的正确打开方式
在博主认为,对于入门级学习java的最佳学习方法莫过于视频+博客+书籍+总结,前三者博主将淋漓尽致地挥毫于这篇博客文章中,至于总结在于个人,实际上越到后面你会发现学习的最好方式就是阅读参考官方文档其次就是国内的书籍,博客次之,这又是一个层次了,这里暂时不提后面再谈。博主将为各位入门java保驾护航,各位只管冲鸭!!!上天是公平的,只要不辜负时间,时间自然不会辜负你。 何谓学习?博主所理解的学习,它是一个过程,是一个不断累积、不断沉淀、不断总结、善于传达自己的个人见解以及乐于分享的过程。
程序员必须掌握的核心算法有哪些?
由于我之前一直强调数据结构以及算法学习的重要性,所以就有一些读者经常问我,数据结构与算法应该要学习到哪个程度呢?,说实话,这个问题我不知道要怎么回答你,主要取决于你想学习到哪些程度,不过针对这个问题,我稍微总结一下我学过的算法知识点,以及我觉得值得学习的算法。这些算法与数据结构的学习大多数是零散的,并没有一本把他们全部覆盖的书籍。下面是我觉得值得学习的一些算法以及数据结构,当然,我也会整理一些看过...
Python——画一棵漂亮的樱花树(不同种樱花+玫瑰+圣诞树喔)
最近翻到一篇知乎,上面有不少用Python(大多是turtle库)绘制的树图,感觉很漂亮,我整理了一下,挑了一些我觉得不错的代码分享给大家(这些我都测试过,确实可以生成) one 樱花树 动态生成樱花 效果图(这个是动态的): 实现代码 import turtle as T import random import time # 画樱花的躯干(60,t) def Tree(branch, ...
大学四年自学走来,这些私藏的实用工具/学习网站我贡献出来了
大学四年,看课本是不可能一直看课本的了,对于学习,特别是自学,善于搜索网上的一些资源来辅助,还是非常有必要的,下面我就把这几年私藏的各种资源,网站贡献出来给你们。主要有:电子书搜索、实用工具、在线视频学习网站、非视频学习网站、软件下载、面试/求职必备网站。 注意:文中提到的所有资源,文末我都给你整理好了,你们只管拿去,如果觉得不错,转发、分享就是最大的支持了。 一、电子书搜索 对于大部分程序员...
为啥国人偏爱Mybatis,而老外喜欢Hibernate/JPA呢?
关于SQL和ORM的争论,永远都不会终止,我也一直在思考这个问题。昨天又跟群里的小伙伴进行了一番讨论,感触还是有一些,于是就有了今天这篇文。 声明:本文不会下关于Mybatis和JPA两个持久层框架哪个更好这样的结论。只是摆事实,讲道理,所以,请各位看官勿喷。 一、事件起因 关于Mybatis和JPA孰优孰劣的问题,争论已经很多年了。一直也没有结论,毕竟每个人的喜好和习惯是大不相同的。我也看...
我在支付宝花了1分钟,查到了女朋友的开房记录!
在大数据时代下,不管你做什么都会留下蛛丝马迹,只要学会把各种软件运用到极致,捉奸简直轻而易举。今天就来给大家分享一下,什么叫大数据抓出轨。据史料证明,马爸爸年轻时曾被...
shell脚本:备份数据库、代码上线
备份MySQL数据库 场景: 一台MySQL服务器,跑着5个数据库,在没有做主从的情况下,需要对这5个库进行备份 需求: 1)每天备份一次,需要备份所有的库 2)把备份数据存放到/data/backup/下 3)备份文件名称格式示例:dbname-2019-11-23.sql 4)需要对1天以前的所有sql文件压缩,格式为gzip 5)本地数据保留1周 6)需要把备份的数据同步到远程备份中心,假如...
聊聊C语言和指针的本质
坐着绿皮车上海到杭州,24块钱,很宽敞,在火车上非正式地聊几句。 很多编程语言都以 “没有指针” 作为自己的优势来宣传,然而,对于C语言,指针却是与生俱来的。 那么,什么是指针,为什么大家都想避开指针。 很简单, 指针就是地址,当一个地址作为一个变量存在时,它就被叫做指针,该变量的类型,自然就是指针类型。 指针的作用就是,给出一个指针,取出该指针指向地址处的值。为了理解本质,我们从计算机模型说起...
为什么你学不过动态规划?告别动态规划,谈谈我的经验
动态规划难吗?说实话,我觉得很难,特别是对于初学者来说,我当时入门动态规划的时候,是看 0-1 背包问题,当时真的是一脸懵逼。后来,我遇到动态规划的题,看的懂答案,但就是自己不会做,不知道怎么下手。就像做递归的题,看的懂答案,但下不了手,关于递归的,我之前也写过一篇套路的文章,如果对递归不大懂的,强烈建议看一看:为什么你学不会递归,告别递归,谈谈我的经验 对于动态规划,春招秋招时好多题都会用到动态...
程序员一般通过什么途径接私活?
二哥,你好,我想知道一般程序猿都如何接私活,我也想接,能告诉我一些方法吗? 上面是一个读者“烦不烦”问我的一个问题。其实不止是“烦不烦”,还有很多读者问过我类似这样的问题。 我接的私活不算多,挣到的钱也没有多少,加起来不到 20W。说实话,这个数目说出来我是有点心虚的,毕竟太少了,大家轻喷。但我想,恰好配得上“一般程序员”这个称号啊。毕竟苍蝇再小也是肉,我也算是有经验的人了。 唾弃接私活、做外...
字节跳动面试官这样问消息队列:分布式事务、重复消费、顺序消费,我整理了一下
你知道的越多,你不知道的越多 点赞再看,养成习惯 GitHub上已经开源 https://github.com/JavaFamily 有一线大厂面试点脑图、个人联系方式和人才交流群,欢迎Star和完善 前言 消息队列在互联网技术存储方面使用如此广泛,几乎所有的后端技术面试官都要在消息队列的使用和原理方面对小伙伴们进行360°的刁难。 作为一个在互联网公司面一次拿一次Offer的面霸...
2020年大前端发展趋势
迅速发展的前端开发,在每⼀年,都为开发者带来了新的关键词。2019 年已步⼊尾声,2020 年前端发展的关键词⼜将有哪些呢?发展的方向又会是什么呢?参考2019年大前端的发展,不出意外,前端依旧会围绕⼩程序、超级APP、跨端开发、前端⼯程化以及新技术运用等几个方面进行展开(可以参考2019年大前端技术趋势深度解读)。 小程序 在⼩程序⽅⾯,今年仍然是⼩程序突⻜猛进的⼀年,各⼤主流的 App 都上线...
如何安装 IntelliJ IDEA 最新版本——详细教程
IntelliJ IDEA 简称 IDEA,被业界公认为最好的 Java 集成开发工具,尤其在智能代码助手、代码自动提示、代码重构、代码版本管理(Git、SVN、Maven)、单元测试、代码分析等方面有着亮眼的发挥。IDEA 产于捷克,开发人员以严谨著称的东欧程序员为主。IDEA 分为社区版和付费版两个版本。 我呢,一直是 Eclipse 的忠实粉丝,差不多十年的老用户了。很早就接触到了 IDEA...
1个月时间整理了2019年上千道Java面试题,近500页文档!
Spring 面试题 1、一般问题 1.1、不同版本的 spring Framework 有哪些主要功能? 1.2、什么是 spring Framework? 1.3、列举 spring Framework 的优点。 1.4、spring Framework 有哪些不同的功能? 1.5、spring Framework 中有多少个模块,它们分别是什么? 1.6、什么是 spring ...
面试还搞不懂redis,快看看这40道面试题(含答案和思维导图)
Redis 面试题 1、什么是 Redis?. 2、Redis 的数据类型? 3、使用 Redis 有哪些好处? 4、Redis 相比 Memcached 有哪些优势? 5、Memcache 与 Redis 的区别都有哪些? 6、Redis 是单进程单线程的? 7、一个字符串类型的值能存储最大容量是多少? 8、Redis 的持久化机制是什么?各自的优缺点? 9、Redis 常见性...
为什么要推荐大家学习字节码?
配套视频: 为什么推荐大家学习Java字节码 https://www.bilibili.com/video/av77600176/ 一、背景 本文主要探讨:为什么要学习 JVM 字节码? 可能很多人会觉得没必要,因为平时开发用不到,而且不学这个也没耽误学习。 但是这里分享一点感悟,即人总是根据自己已经掌握的知识和技能来解决问题的。 这里有个悖论,有时候你觉得有些技术没用恰恰是...
在阿里,40岁的奋斗姿势
在阿里,40岁的奋斗姿势 在阿里,什么样的年纪可以称为老呢?35岁? 在云网络,有这样一群人,他们的平均年龄接近40,却刚刚开辟职业生涯的第二战场。 他们的奋斗姿势是什么样的呢? 洛神赋 “翩若惊鸿,婉若游龙。荣曜秋菊,华茂春松。髣髴兮若轻云之蔽月,飘飖兮若流风之回雪。远而望之,皎若太阳升朝霞;迫而察之,灼若芙蕖出渌波。” 爱洛神,爱阿里云 2018年,阿里云网络产品部门启动洛神2.0升...
【超详细分析】关于三次握手与四次挥手面试官想考我们什么?
在面试中,三次握手和四次挥手可以说是问的最频繁的一个知识点了,我相信大家也都看过很多关于三次握手与四次挥手的文章,今天的这篇文章,重点是围绕着面试,我们应该掌握哪些比较重要的点,哪些是比较被面试官给问到的,我觉得如果你能把我下面列举的一些点都记住、理解,我想就差不多了。 三次握手 当面试官问你为什么需要有三次握手、三次握手的作用、讲讲三次三次握手的时候,我想很多人会这样回答: 首先很多人会先讲下握...
压测学习总结(1)——高并发性能指标:QPS、TPS、RT、吞吐量详解
一、QPS,每秒查询 QPS:Queries Per Second意思是“每秒查询率”,是一台服务器每秒能够相应的查询次数,是对一个特定的查询服务器在规定时间内所处理流量多少的衡量标准。互联网中,作为域名系统服务器的机器的性能经常用每秒查询率来衡量。 二、TPS,每秒事务 TPS:是TransactionsPerSecond的缩写,也就是事务数/秒。它是软件测试结果的测量单位。一个事务是指一...
新程序员七宗罪
当我发表这篇文章《为什么每个工程师都应该开始考虑开发中的分析和编程技能呢?》时,我从未想到它会对读者产生如此积极的影响。那些想要开始探索编程和数据科学领域的人向我寻求建议;还有一些人问我下一篇文章的发布日期;还有许多人询问如何顺利过渡到这个职业。我非常鼓励大家继续分享我在这个旅程的经验,学习,成功和失败,以帮助尽可能多的人过渡到一个充满无数好处和机会的职业生涯。亲爱的读者,谢谢你。 -罗伯特。 ...
活到老,学到老,程序员也该如此
全文共2763字,预计学习时长8分钟 图片来源:Pixabay 此前,“网传阿里巴巴要求尽快实现P8全员35周岁以内”的消息闹得沸沸扬扬。虽然很快被阿里辟谣,但苍蝇不叮无缝的蛋,无蜜不招彩蝶蜂。消息从何而来?真相究竟怎样?我们无从而知。我们只知道一个事实:不知从何时开始,程序猿也被划在了“吃青春饭”行业之列。 饱受“996ICU”摧残后,好不容易“头秃了变强了”,即将步入为“高...
Vue快速实现通用表单验证
本文开篇第一句话,想引用鲁迅先生《祝福》里的一句话,那便是:“我真傻,真的,我单单知道后端整天都是CRUD,我没想到前端整天都是Form表单”。这句话要从哪里说起呢?大概要从最近半个月的“全栈工程师”说起。项目上需要做一个城市配载的功能,顾名思义,就是通过框选和拖拽的方式在地图上完成配载。博主选择了前后端分离的方式,在这个过程中发现:首先,只要有依赖jQuery的组件,譬如Kendoui,即使使用...
2019年Spring Boot面试都问了什么?快看看这22道面试题!
Spring Boot 面试题 1、什么是 Spring Boot? 2、Spring Boot 有哪些优点? 3、什么是 JavaConfig? 4、如何重新加载 Spring Boot 上的更改,而无需重新启动服务器? 5、Spring Boot 中的监视器是什么? 6、如何在 Spring Boot 中禁用 Actuator 端点安全性? 7、如何在自定义端口上运行 Sprin...
Unity项目在pc和ios设备上黑屏的原因探究
0x00 由于项目上线了windows平台的项目(别问我为什么,咱也不敢说,咱也不敢问),由Unity5.4.6升级到Unity2018的过程中,遇到了各种各样的坑,本文为避坑指南1。本项目没有使用HDR和抗锯齿,由于查这几个问题查到吐血,前后用了3天的时间,本文充满了怨气,行文非常啰嗦,需要快速解决问题的,可以直接拉到最后看结论。 0x01 法线贴图 项目在unity2018出了新的androi...
关于裁员几点看法及建议
最近网易裁员事件引起广泛关注,昨天网易针对此事,也发了声明,到底谁对谁错,孰是孰非?我们作为吃瓜观众实在是知之甚少,所以不敢妄下定论。身处软件开发这个行业,近一两年来,对...
面试官:关于Java性能优化,你有什么技巧
通过使用一些辅助性工具来找到程序中的瓶颈,然后就可以对瓶颈部分的代码进行优化。 一般有两种方案:即优化代码或更改设计方法。我们一般会选择后者,因为不去调用以下代码要比调用一些优化的代码更能提高程序的性能。而一个设计良好的程序能够精简代码,从而提高性能。 下面将提供一些在JAVA程序的设计和编码中,为了能够提高JAVA程序的性能,而经常采用的一些方法和技巧。 1.对象的生成和大小的调整。 J...
【图解算法面试】记一次面试:说说游戏中的敏感词过滤是如何实现的?
版权声明:本文为苦逼的码农原创。未经同意禁止任何形式转载,特别是那些复制粘贴到别的平台的,否则,必定追究。欢迎大家多多转发,谢谢。 小秋今天去面试了,面试官问了一个与敏感词过滤算法相关的问题,然而小秋对敏感词过滤算法一点也没听说过。于是,有了下下事情的发生… 面试官开怼 面试官:玩过王者荣耀吧?了解过敏感词过滤吗?,例如在游戏里,如果我们发送“你在干嘛?麻痹演员啊你?”,由于“麻痹”是一个敏感词,...
GitHub 标星 1.6w+,我发现了一个宝藏项目,作为编程新手有福了!
大家好,我是 Rocky0429,一个最近老在 GitHub 上闲逛的蒟蒻… 特别惭愧的是,虽然我很早就知道 GitHub,但是学会逛 GitHub 的时间特别晚。当时一方面是因为菜,看着这种全是英文的东西难受,不知道该怎么去玩,另一方面是一直在搞 ACM,没有做一些工程类的项目,所以想当然的以为和 GitHub 也没什么关系(当然这种想法是错误的)。 后来自己花了一个星期看完了 Pyt...
计算机专业的书普遍都这么贵,你们都是怎么获取资源的?
介绍几个可以下载编程电子书籍的网站。 1.Github Github上编程书资源很多,你可以根据类型和语言去搜索。推荐几个热门的: free-programming-books-zh_CN:58K 星的GitHub,编程语言、WEB、函数、大数据、操作系统、在线课程、数据库相关书籍应有尽有,共有几百本。 Go语言高级编程:涵盖CGO,Go汇编语言,RPC实现,Protobuf插件实现,Web框架实...
毕业5年,我问遍了身边的大佬,总结了他们的学习方法
我问了身边10个大佬,总结了他们的学习方法,原来成功都是有迹可循的。
推荐10个堪称神器的学习网站
每天都会收到很多读者的私信,问我:“二哥,有什么推荐的学习网站吗?最近很浮躁,手头的一些网站都看烦了,想看看二哥这里有什么新鲜货。” 今天一早做了个恶梦,梦到被老板辞退了。虽然说在我们公司,只有我辞退老板的份,没有老板辞退我这一说,但是还是被吓得 4 点多都起来了。(主要是因为我掌握着公司所有的核心源码,哈哈哈) 既然 4 点多起来,就得好好利用起来。于是我就挑选了 10 个堪称神器的学习网站,推...
这些软件太强了,Windows必装!尤其程序员!
Windows可谓是大多数人的生产力工具,集娱乐办公于一体,虽然在程序员这个群体中都说苹果是信仰,但是大部分不都是从Windows过来的,而且现在依然有很多的程序员用Windows。 所以,今天我就把我私藏的Windows必装的软件分享给大家,如果有一个你没有用过甚至没有听过,那你就赚了????,这可都是提升你幸福感的高效率生产力工具哦! 走起!???? NO、1 ScreenToGif 屏幕,摄像头和白板...
MacBook Pro 入手一年了,到底香不香?
最近又有小伙伴问到底值不值得入手一台 MacBook Pro,松哥自己在 2018 年 10 月份的时候入手了一台,到现在为止,也用了一年多了,今天就来和小伙伴们聊一聊使用感受,至于到底值不值,需要大家自行判断。 我的第一台笔记本是大一第二学期(2012 年 4 月份)入手的,是一台 Sony 的 VAIO,这台电脑现在也一直在用,给大家录制的视频教程都是用这台电脑录制了,在接近 8 年的时间里,...
大学四年因为知道了这32个网站,我成了别人眼中的大神!
依稀记得,毕业那天,我们导员发给我毕业证的时候对我说“你可是咱们系的风云人物啊”,哎呀,别提当时多开心啦????,嗯,我们导员是所有导员中最帅的一个,真的???? 不过,导员说的是实话,很多人都叫我大神的,为啥,因为我知道这32个网站啊,你说强不强????,这次是绝对的干货,看好啦,走起来! PS:每个网站都是学计算机混互联网必须知道的,真的牛杯,我就不过多介绍了,大家自行探索,觉得没用的,尽管留言吐槽吧???? 社...
【程序人生】程序员接私活常用平台汇总
00. 目录 文章目录00. 目录01. 前言02. 程序员客栈03. 码市04. 猪八戒网05. 开源众包06. 智城外包网07. 实现网08. 猿急送09. 人人开发10. 开发邦11. 电鸭社区12. 快码13. 英选14. Upwork15. Freelancer16. Dribbble17. Remoteok18. Toptal19. AngelList20. Topcoder21. ...
关于终身成长的几点感想
上周,为了凑单把加入购物车很久的《终身成长:重新定义成功的思维模式》也一起买了。 起初我以为只是单纯的心灵鸡汤类的书,后来看介绍说是“影响美国一代人的心理励志之作,被无数次引用的成功学观点。美国亚马逊心理畅销书在榜10年,《 时代周刊》《早安美国》《华尔街日报》热赞,比尔·盖茨撰文推荐。” 简单来讲,全书主要是讲了两种思维方式:固定型思维模式者和成长型思维模式。有理论,有案例,有方法。书中的关...
Fiddler+夜神模拟器进行APP抓包
Fiddler+夜神模拟器进行APP抓包 作者:霞落满天 需求:对公司APP进行抓包获取详细的接口信息,这是现在开发必备的。 工具:Fiddler抓包,夜神模拟器 模拟手机 安装APP 1.下载Fiddler https://www.telerik.com/download/fiddler Fiddler正是在这里帮助您记录计算机和Internet之间传递的所有HTTP和HTTPS通信...
2020年,冯唐49岁:我给20、30岁IT职场年轻人的建议
点击“技术领导力”关注∆每天早上8:30推送 作者|Mr.K 编辑| Emma 来源|技术领导力(ID:jishulingdaoli) 前天的推文《冯唐:职场人35岁以后,方法论比经验重要》,收到了不少读者的反馈,觉得挺受启发。其实,冯唐写了不少关于职场方面的文章,都挺不错的。可惜大家只记住了“春风十里不如你”、“如何避免成为油腻腻的中年人”等不那么正经的文章。 本文整理了冯...
作为一个程序员,CPU的这些硬核知识你必须会!
CPU对每个程序员来说,是个既熟悉又陌生的东西? 如果你只知道CPU是中央处理器的话,那可能对你并没有什么用,那么作为程序员的我们,必须要搞懂的就是CPU这家伙是如何运行的,尤其要搞懂它里面的寄存器是怎么一回事,因为这将让你从底层明白程序的运行机制。 随我一起,来好好认识下CPU这货吧 把CPU掰开来看 对于CPU来说,我们首先就要搞明白它是怎么回事,也就是它的内部构造,当然,CPU那么牛的一个东...
破14亿,Python分析我国存在哪些人口危机!
一、背景 二、爬取数据 三、数据分析 1、总人口 2、男女人口比例 3、人口城镇化 4、人口增长率 5、人口老化(抚养比) 6、各省人口 7、世界人口 四、遇到的问题 遇到的问题 1、数据分页,需要获取从1949-2018年数据,观察到有近20年参数:LAST20,由此推测获取近70年的参数可设置为:LAST70 2、2019年数据没有放上去,可以手动添加上去 3、将数据进行 行列转换 4、列名...
相关热词 c# id读写器 c#俄罗斯方块源码 c# linq原理 c# 装箱有什么用 c#集合 复制 c# 一个字符串分组 c++和c#哪个就业率高 c# 批量动态创建控件 c# 模块和程序集的区别 c# gmap 截图
立即提问