使用gesim.downloader.load()加载数据集出现URLError

使用gesim.downloader.load()加载数据集出现URLError,入门小白,求大佬指示

代码:

import gensim.downloader as api
model = api.load("glove-twitter-25")

报错情况:

E:\Anaconda\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Traceback (most recent call last):
  File "E:\Anaconda\lib\urllib\request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "E:\Anaconda\lib\http\client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "E:\Anaconda\lib\http\client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "E:\Anaconda\lib\http\client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "E:\Anaconda\lib\http\client.py", line 1016, in _send_output
    self.send(msg)
  File "E:\Anaconda\lib\http\client.py", line 956, in send
    self.connect()
  File "E:\Anaconda\lib\http\client.py", line 1384, in connect
    super().connect()
  File "E:\Anaconda\lib\http\client.py", line 928, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "E:\Anaconda\lib\socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "E:\Anaconda\lib\socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11004] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/lenovo/Desktop/NLP/test.py", line 113, in <module>
    model = api.load("glove-twitter-25")
  File "E:\Anaconda\lib\site-packages\gensim\downloader.py", line 411, in load
    _download(name)
  File "E:\Anaconda\lib\site-packages\gensim\downloader.py", line 287, in _download
    urllib.urlretrieve(url_load_file, init_path)
  File "E:\Anaconda\lib\urllib\request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "E:\Anaconda\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "E:\Anaconda\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "E:\Anaconda\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "E:\Anaconda\lib\urllib\request.py", line 563, in error
    result = self._call_chain(*args)
  File "E:\Anaconda\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "E:\Anaconda\lib\urllib\request.py", line 755, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "E:\Anaconda\lib\urllib\request.py", line 525, in open
    response = self._open(req, data)
  File "E:\Anaconda\lib\urllib\request.py", line 543, in _open
    '_open', req)
  File "E:\Anaconda\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "E:\Anaconda\lib\urllib\request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "E:\Anaconda\lib\urllib\request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed>

1个回答

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
一个百度拇指医生爬虫,想要先实现爬取某个问题的所有链接,但是爬不出来东西。求各位大神帮忙看一下这是为什么?
#写在前面的话 在这个爬虫里我想实现把百度拇指医生里关于“咳嗽”的链接全部爬取下来,下一步要进行的是把爬取到的每个链接里的items里面的内容爬取下来,但是我在第一步就卡住了,求各位大神帮我看一下吧。之前刚刚发了一篇问答,但是不知道怎么回事儿,现在找不到了,(貌似是被删了...?)救救小白吧!感激不尽! 这个是我的爬虫的结构 ![图片说明](https://img-ask.csdn.net/upload/201911/27/1574787999_274479.png) ##ks: ``` # -*- coding: utf-8 -*- import scrapy from kesou.items import KesouItem from scrapy.selector import Selector from scrapy.spiders import Spider from scrapy.http import Request ,FormRequest import pymongo class KsSpider(scrapy.Spider): name = 'ks' allowed_domains = ['kesou,baidu.com'] start_urls = ['https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M'] def parse(self, response): item = KesouItem() contents = response.xpath('.//h3[@class="t"]') for content in contents: url = content.xpath('.//a/@href').extract()[0] item['url'] = url yield item if self.offset < 760: self.offset += 10 yield scrapy.Request(url = "https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=" + str(self.offset) + "&oq=%E5%92%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX%2B1M",callback=self.parse,dont_filter=True) ``` ##items: ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class KesouItem(scrapy.Item): # 问题ID question_ID = scrapy.Field() # 问题描述 question = scrapy.Field() # 医生回答发表时间 answer_pubtime = scrapy.Field() # 问题详情 description = scrapy.Field() # 医生姓名 doctor_name = scrapy.Field() # 医生职位 doctor_title = scrapy.Field() # 医生所在医院 hospital = scrapy.Field() ``` ##middlewares: ``` # -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class KesouSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class KesouDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) ``` ##piplines: ``` # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy.utils.project import get_project_settings settings = get_project_settings() class KesouPipeline(object): def __init__(self): host = settings["MONGODB_HOST"] port = settings["MONGODB_PORT"] dbname = settings["MONGODB_DBNAME"] sheetname= settings["MONGODB_SHEETNAME"] # 创建MONGODB数据库链接 client = pymongo.MongoClient(host = host, port = port) # 指定数据库 mydb = client[dbname] # 存放数据的数据库表名 self.sheet = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.sheet.insert(data) return item ``` ##settings: ``` # -*- coding: utf-8 -*- # Scrapy settings for kesou project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'kesou' SPIDER_MODULES = ['kesou.spiders'] NEWSPIDER_MODULE = 'kesou.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'kesou (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0" # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'kesou.middlewares.KesouSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'kesou.middlewares.KesouDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'kesou.pipelines.KesouPipeline': 300, } # MONGODB 主机名 MONGODB_HOST = "127.0.0.1" # MONGODB 端口号 MONGODB_PORT = 27017 # 数据库名称 MONGODB_DBNAME = "ks" # 存放数据的表名称 MONGODB_SHEETNAME = "ks_urls" # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ``` ##run.py: ``` # -*- coding: utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl ks".split()) ``` ##这个是运行出来的结果: ``` PS D:\scrapy_project\kesou> scrapy crawl ks 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: kesou) 2019-11-27 00:14:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twis.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryphy 2.6.1, Platform Windows-10-10.0.18362-SP0 2019-11-27 00:14:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'kesou', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['kesou.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67 2019-11-27 00:14:17 [scrapy.extensions.telnet] INFO: Telnet Password: 051629c46f34abdf 2019-11-27 00:14:17 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-27 00:14:19 [scrapy.middleware] INFO: Enabled item pipelines: ['kesou.pipelines.KesouPipeline'] 2019-11-27 00:14:19 [scrapy.core.engine] INFO: Spider opened 2019-11-27 00:14:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-27 00:14:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-27 00:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%E5%92%B3%E5&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFXJvk%2FSYX% (referer: None) 2019-11-27 00:14:20 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.baidu.com/s?wd=%E5%92%B3%E5%97%BD&pn=0&oq=%B3%E5%97%BD&ct=2097152&ie=utf-8&si=muzhi.baidu.com&rsv_pq=980e0c55000e2402&rsv_t=ed3f0i5yeefxTMskgzim00cCUyVujMRnw0Vs4o1%2Bo%2Bohf9rFFSYX%2B1M> (referer: None) Traceback (most recent call last): File "d:\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "d:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "D:\scrapy_project\kesou\kesou\spiders\ks.py", line 19, in parse item['url'] = url File "d:\anaconda3\lib\site-packages\scrapy\item.py", line 73, in __setitem__ (self.__class__.__name__, key)) KeyError: 'KesouItem does not support field: url' 2019-11-27 00:14:20 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 2019-11-27 00:14:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 438, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 68368, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.992207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 26, 16, 14, 20, 855804), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2019, 11, 26, 16, 14, 19, 863597)} 2019-11-27 00:14:21 [scrapy.core.engine] INFO: Spider closed (finished) ```
用okhttp实现断点续传,网络请求进OnFailure,急急急,大神们
ProgressDownloader类 ``` public class ProgressDownloader { public static final String TAG = "TestProgressDownloader"; private ProgressResponseBody.ProgressListener progressListener; private String url; private OkHttpClient client; private File destination; private Call call; public ProgressDownloader(String url, File destination, ProgressResponseBody.ProgressListener progressListener) { this.url = url; this.destination = destination; this.progressListener = progressListener; //在下载、暂停后的继续下载中可复用同一个client对象 client = getProgressClient(); } //每次下载需要新建新的Call对象 private Call newCall(long startPoints) { Request request = new Request.Builder() .get() .url(url) .header("RANGE", "bytes=" + startPoints + "-")//断点续传要用到的,指示下载的区间 .build(); return client.newCall(request); } public OkHttpClient getProgressClient() { // 拦截器,用上ProgressResponseBody Interceptor interceptor = new Interceptor() { @Override public Response intercept(Chain chain) throws IOException { Response originalResponse = chain.proceed(chain.request()); return originalResponse.newBuilder() .body(new ProgressResponseBody(originalResponse.body(), progressListener)) .build(); } }; return new OkHttpClient.Builder() .addNetworkInterceptor(interceptor) .build(); } //startsPoint指定开始下载的点 public void download(final long startsPoint) { call = newCall(startsPoint); call.enqueue(new Callback() { @Override public void onFailure(Call call, IOException e) { Log.e("=======================","fail"); } @Override public void onResponse(Call call, Response response) throws IOException { Log.e("=======================","pass"); } }); } public void pause() { if(call!=null){ call.cancel(); } } private void save(Response response, long startsPoint) { ResponseBody body = response.body(); InputStream in = body.byteStream(); FileChannel channelOut = null; // 随机访问文件,可以指定断点续传的起始位置 RandomAccessFile randomAccessFile = null; try { randomAccessFile = new RandomAccessFile(destination, "rwd"); //Chanel NIO中的用法,由于RandomAccessFile没有使用缓存策略,直接使用会使得下载速度变慢,亲测缓存下载3.3秒的文件,用普通的RandomAccessFile需要20多秒。 channelOut = randomAccessFile.getChannel(); // 内存映射,直接使用RandomAccessFile,是用其seek方法指定下载的起始位置,使用缓存下载,在这里指定下载位置。 MappedByteBuffer mappedBuffer = channelOut.map(FileChannel.MapMode.READ_WRITE, startsPoint, body.contentLength()); byte[] buffer = new byte[1024]; int len; while ((len = in.read(buffer)) != -1) { mappedBuffer.put(buffer, 0, len); } } catch (IOException e) { e.printStackTrace(); }finally { try { in.close(); if (channelOut != null) { channelOut.close(); } if (randomAccessFile != null) { randomAccessFile.close(); } } catch (IOException e) { e.printStackTrace(); } } } } ``` MainActivity类 ``` /** * 1.添加依赖 * 2.生成带进度监听的ProgressResponseBody * 3.创建ProgressDownloader * 4.清单文件中添加网络权限和文件访问权限 */ public class MainActivity extends AppCompatActivity implements ProgressResponseBody.ProgressListener{ public static final String TAG = "MainActivity"; public static final String PACKAGE_URL = "http://gdown.baidu.com/data/wisegame/df65a597122796a4/weixin_821.apk"; @Bind(R.id.progressBar) ProgressBar progressBar; private long breakPoints; private ProgressDownloader downloader; private File file; private long totalBytes; private long contentLength; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); ButterKnife.bind(this); } @OnClick({R.id.downloadButton, R.id.cancel_button, R.id.continue_button}) public void onClick(View view) { switch (view.getId()) { case R.id.downloadButton: // 新下载前清空断点信息 breakPoints = 0L; file = new File(Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOWNLOADS), "sample.apk"); downloader = new ProgressDownloader(PACKAGE_URL, file, this); downloader.download(0L); break; case R.id.cancel_button: downloader.pause(); Toast.makeText(this, "下载暂停", Toast.LENGTH_SHORT).show(); // 存储此时的totalBytes,即断点位置。 breakPoints = totalBytes; break; case R.id.continue_button: downloader.download(breakPoints); break; } } @Override public void onPreExecute(long contentLength) { // 文件总长只需记录一次,要注意断点续传后的contentLength只是剩余部分的长度 if (this.contentLength == 0L) { this.contentLength = contentLength; progressBar.setMax((int) (contentLength / 1024)); } } @Override public void update(long totalBytes, boolean done) { // 注意加上断点的长度 this.totalBytes = totalBytes + breakPoints; progressBar.setProgress((int) (totalBytes + breakPoints) / 1024); if (done) { // 切换到主线程 Observable .empty() .observeOn(AndroidSchedulers.mainThread()) .doOnCompleted(new Action0() { @Override public void call() { Toast.makeText(MainActivity.this, "下载完成", Toast.LENGTH_SHORT).show(); } }) .subscribe(); } } } ``` 代码链接:https://blog.csdn.net/halaoda/article/details/78502693
为什么我用scrapy爬取谷歌应用市场却爬取不到内容?
我想用scrapy爬取谷歌应用市场,代码没有报错,但是却爬取不到内容,这是为什么? ``` # -*- coding: utf-8 -*- import scrapy # from scrapy.spiders import CrawlSpider, Rule # from scrapy.linkextractors import LinkExtractor from gp.items import GpItem # from html.parser import HTMLParser as SGMLParser import requests class GoogleSpider(scrapy.Spider): name = 'google' allowed_domains = ['https://play.google.com/'] start_urls = ['https://play.google.com/store/apps/'] ''' rules = [ Rule(LinkExtractor(allow=("https://play\.google\.com/store/apps/details",)), callback='parse_app', follow=True), ] ''' def parse(self, response): selector = scrapy.Selector(response) urls = selector.xpath('//a[@class="LkLjZd ScJHi U8Ww7d xjAeve nMZKrb id-track-click"]/@href').extract() link_flag = 0 links = [] for link in urls: links.append(link) for each in urls: yield scrapy.Request(links[link_flag], callback=self.parse_next, dont_filter=True) link_flag += 1 def parse_next(self, response): selector = scrapy.Selector(response) app_urls = selector.xpath('//div[@class="details"]/a[@class="title"]/@href').extract() print(app_urls) urls = [] for url in app_urls: url = "http://play.google.com" + url print(url) urls.append(url) link_flag = 0 for each in app_urls: yield scrapy.Request(urls[link_flag], callback=self.parse_app, dont_filter=True) link_flag += 1 def parse_app(self, response): item = GpItem() item['app_url'] = response.url item['app_name'] = response.xpath('//div[@itemprop="name"]').xpath('text()').extract() item['app_icon'] = response.xpath('//img[@itempro="image"]/@src') item['app_developer'] = response.xpath('//') print(response.text) yield item ``` terminal运行信息如下: ``` BettyMacbookPro-764:gp zhanjinyang$ scrapy crawl google 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: gp) 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.1 (default, Dec 14 2018, 13:28:58) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Darwin-18.5.0-x86_64-i386-64bit 2019-11-12 08:46:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'gp', 'NEWSPIDER_MODULE': 'gp.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['gp.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'} 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet Password: b2d7dedf1f4a91eb 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled item pipelines: ['gp.pipelines.GpPipeline'] 2019-11-12 08:46:45 [scrapy.core.engine] INFO: Spider opened 2019-11-12 08:46:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-12 08:46:45 [py.warnings] WARNING: /anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://play.google.com/ in allowed_domains. warnings.warn(message, URLWarning) 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-12 08:46:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/robots.txt> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/store/apps/> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-12 08:46:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 810, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 232419, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 12, 8, 46, 46, 474543), 'log_count/DEBUG': 2, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'memusage/max': 58175488, 'memusage/startup': 58175488, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2019, 11, 12, 8, 46, 45, 562775)} 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Spider closed (finished) ``` 求助!!!
选定的 Parcel 正在下载并安装在群集的所有主机上失败
CM6安装CDH6的时候到了这一步出现了这个问题![图片说明](https://img-ask.csdn.net/upload/201910/31/1572506575_776628.png) 下图是/var/log/cloudera-scm-agent/cloudera-scm-agent.log的报错信息 ![图片说明](https://img-ask.csdn.net/upload/201910/31/1572506709_116880.png) Traceback (most recent call last): File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/downloader.py", line 502, in callable callback(url, curr_op) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/parcel_cache.py", line 200, in cb raise e Exception: Src file /opt/cloudera/parcels/.flood/CDH-6.1.0-1.cdh6.1.0.p0.770702-el7.parcel/CDH-6.1.0-1.cdh6.1.0.p0.770702-el7.parcel does not exist
请问scrapy为什么会爬取失败
C:\Users\Administrator\Desktop\新建文件夹\xiaozhu>python -m scrapy crawl xiaozhu 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: xiaozhu) 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9 .5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.5.3 (v 3.5.3:1880cb95a742, Jan 16 2017, 15:51:26) [MSC v.1900 32 bit (Intel)], pyOpenSS L 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-7-6.1 .7601-SP1 2019-10-26 11:43:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'xi aozhu', 'SPIDER_MODULES': ['xiaozhu.spiders'], 'NEWSPIDER_MODULE': 'xiaozhu.spid ers'} 2019-10-26 11:43:11 [scrapy.extensions.telnet] INFO: Telnet Password: c61bda45d6 3b8138 2019-10-26 11:43:11 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider opened 2019-10-26 11:43:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2019-10-26 11:43:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-10-26 11:43:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting ( 307) to <GET https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozh u.com%2Ffangzi%2F125535477903.html> from <GET http://bj.xiaozhu.com/fangzi/12553 5477903.html> 2019-10-26 11:43:12 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://bizve rify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2Ffangzi%2F125535477 903.html> (referer: None) 2019-10-26 11:43:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2 Ffangzi%2F125535477903.html>: HTTP status code is not handled or not allowed 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Closing spider (finished) 2019-10-26 11:43:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 529, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 725, 'downloader/response_count': 2, 'downloader/response_status_count/307': 1, 'downloader/response_status_count/400': 1, 'elapsed_time_seconds': 0.427734, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 889648), 'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/400': 1, 'log_count/DEBUG': 2, 'log_count/INFO': 11, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 461914)} 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider closed (finished)
如何解决拉勾网302问题? 求大牛指导
最近在抓取拉勾网招聘信息的过程中 抓取一段时间后 会出现302重定向 ![图片说明](https://img-ask.csdn.net/upload/201910/16/1571190330_190767.png) 检查后发现被重定向至登录页面 ![图片说明](https://img-ask.csdn.net/upload/201910/16/1571190672_952436.png) 本以为完美解决 但结果并没有这么简单,登录后还是会出现302问题 求大神帮忙解惑!! settings配置如下: ``` BOT_NAME = 'LagouSpider' SPIDER_MODULES = ['LagouSpider.spiders'] NEWSPIDER_MODULE = 'LagouSpider.spiders' ROBOTSTXT_OBEY = False CONCURRENT_REQUESTS = 2 DOWNLOAD_DELAY = 3 #禁止重定向 COOKIES_ENABLED = False REDIRECT_ENABLED = False AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 2 DEFAULT_REQUEST_HEADERS = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Connection': 'keep-alive', 'Host': 'www.lagou.com', 'Origin': 'https://www.lagou.com', 'Referer': 'https://www.lagou.com/', } DOWNLOADER_MIDDLEWARES = { # 'LagouSpider.middlewares.LagouspiderDownloaderMiddleware': 543, 'LagouSpider.middlewares.RandomUserAgentMiddleware' : 100, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None, 'LagouSpider.middlewares.LagoucrawlerDownloaderMiddleware' : 543, } ```
重新composer 后,使用是出现超时错误?
今天重装了composer,重装后使用时出现超时错误如下 ``` qi@qi-ideacentre-AIO-300-23ISU:/var/www/html$ composer create-project --prefer-dist laravel/laravel blog [Composer\Downloader\TransportException] The "https://repo.packagist.org/packages.json" file could not be downloaded : failed to open stream: Connection timed out ``` 使用过composer中国镜像,然而报错信息与之前一模一样。 运行环境如下 ubuntu 16.04.1 php 7.1 composer 1.9 composer config如下 ``` [repositories.packagist.org.type] composer [repositories.packagist.org.url] https://packagist.phpcomposer.com [process-timeout] 300 [use-include-path] false [preferred-install] auto [notify-on-install] true [github-protocols] [https, ssh] [vendor-dir] vendor (/var/www/html/vendor) [bin-dir] {$vendor-dir}/bin (/var/www/html/vendor/bin) [cache-dir] /home/qi/.cache/composer [data-dir] /home/qi/.local/share/composer [cache-files-dir] {$cache-dir}/files (/home/qi/.cache/composer/files) [cache-repo-dir] {$cache-dir}/repo (/home/qi/.cache/composer/repo) [cache-vcs-dir] {$cache-dir}/vcs (/home/qi/.cache/composer/vcs) [cache-ttl] 15552000 [cache-files-ttl] 15552000 [cache-files-maxsize] 300MiB (314572800) [bin-compat] auto [discard-changes] false [autoloader-suffix] [sort-packages] false [optimize-autoloader] false [classmap-authoritative] false [apcu-autoloader] false [prepend-autoloader] true [github-domains] [github.com] [bitbucket-expose-hostname] true [disable-tls] false [secure-http] true [cafile] [capath] [github-expose-hostname] true [gitlab-domains] [gitlab.com] [store-auths] prompt [archive-format] tar [archive-dir] . [htaccess-protect] true [use-github-api] true [home] /home/qi/.config/composer ``` 附加信息:报错信息中的https://repo.packagist.org/packages.json在我的浏览器上是能打开的,域名也能ping通 ``` qi@qi-ideacentre-AIO-300-23ISU:/var/www/html$ ping repo.packagist.org PING repo.packagist.org (54.38.136.239) 56(84) bytes of data. 64 bytes from ip-54-38-136.eu (54.38.136.239): icmp_seq=1 ttl=39 time=362 ms 64 bytes from ip-54-38-136.eu (54.38.136.239): icmp_seq=2 ttl=39 time=371 ms 64 bytes from ip-54-38-136.eu (54.38.136.239): icmp_seq=3 ttl=39 time=391 ms 64 bytes from ip-54-38-136.eu (54.38.136.239): icmp_seq=4 ttl=39 time=359 ms ^C --- repo.packagist.org ping statistics --- 5 packets transmitted, 4 received, 20% packet loss, time 4002ms rtt min/avg/max/mdev = 359.543/371.305/391.974/12.752 ms ```
.net core spider【爬虫】如何进行点击页面某个控件再进行获取数据?
.net core spider【爬虫】如何进行点击页面某个控件再进行获取数据? 最好能用如下代码实例进行改造说明一下或其它方式(processor里实现) ``` Spider spider = Spider.Create(site, // use memoery queue scheduler. 使用内存调度 new QueueDuplicateRemovedScheduler(), // use custmize processor for youku 为优酷自定义的 Processor new YoukuPageProcessor()) // use custmize pipeline for youku 为优酷自定义的 Pipeline .AddPipeline(new YoukuPipeline()); spider.Downloader = new HttpClientDownloader(); spider.ThreadNum = 1; spider.EmptySleepTime = 3000; // Start crawler 启动爬虫 spider.Run(); ```
有没有懂python scrapy代理ip的老哥?
一个困扰我好几天的问题:用scrapy写的一个访问58同城的简易爬虫,在中间件里爬了很多有效的代理IP,但是在process____request方法里,代理IP不知道为什么就是不切换,一直使用的是最初成功的那个IP,明明打印的信息是已经更换了新的IP,实际访问的结果来看却还是没有更换。。。 -----这是控制台的打印: ![图片说明](https://img-ask.csdn.net/upload/201909/27/1569590380_292339.png) 这是爬虫文件:xicispider.py name = 'xicispider' allowed_domains = ['58.com'] start_urls = ['https://www.58.com/'] def parse(self, response): reg = r'<title>(.*?)</title>' print(re.search(reg,response.text).group()) yield scrapy.Request(url='https://www.58.com',callback=self.parsep, dont_filter=True) def parsep(self, response): reg = r'<title>(.*?)</title>' print(re.search(reg,response.text).group()) 这是中间件:middleware.py def process_request(self,spider,request): ip = random.choice(self.proxies) print("process_request方法运行了,重新获取的ip是:--------->",ip) request.meta['proxy'] = ip 这是settings.py里的有关配置: DOWNLOADER_MIDDLEWARES = { 'xici.middlewares.XiciDM': 543, }
Android Picasso 请求图片时添加referer的问题
服务器端要做图片防盗链,app端请求图片时需要带上特定的referer,通过以下代码,时不时可以成功,但是服务器端回复说有时候收不到referer。 Interceptor代码: public class PicassoHeaderInterceptor implements Interceptor { @Override public Response intercept(Chain chain) throws IOException { Request.Builder request = chain.request().newBuilder(); request.addHeader("referer", ConstantsValues.IMAGE_REFER); return chain.proceed(request.build()); } } getPicasso方法: public static Picasso getPicasso(){ if(mPicasso==null){ synchronized (UI.class){ if(mPicasso==null){ OkHttpClient okHttpClient = new OkHttpClient(); okHttpClient.interceptors().add(new PicassoHeaderInterceptor()); OkHttpDownloader okHttpDownloader = new OkHttpDownloader(okHttpClient); mPicasso = new Picasso.Builder(getContext()).downloader(okHttpDownloader).build(); } } } return mPicasso; } 加载图片的方法: public static void displayCircleImage(ImageView iv,String url){ getPicasso().with(getContext()).load(url) .placeholder(R.drawable.ic_image_loading) .error(R.drawable.ic_launcher) .transform(new CircleTransform()) .into(iv); } public static void displayImage(ImageView iv,String url){ getPicasso().with( iv.getContext() ) .load(url) .placeholder(R.drawable.ic_image_loading) .error(R.drawable.ic_launcher) .config(Bitmap.Config.RGB_565) .transform(new ZoomTransformation(UI.dip2px(200))) .into(iv); } 本来前段时间刚完成referer的时候是都可以的,但是这2天只有偶尔几张图片能够获取到,百思不得其解,希望各位能够帮帮忙
使用WebDriver中的click操作无法关闭天猫弹出的登陆界面
1.老师留的作业是用scrapy爬动态网页天猫商品的价格,但是用Chrome每次点开网页的时候都会弹出登录界面,虽然不影响爬取价格,但是想把这个页面关闭 网页:https://detail.tmall.com/item.htm?id=555358967936 2.代码: ``` def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. driver = spider.drive driver.get(request.url) # driver.switch_to.frame("sufei-dialog-content") #因为网页需要时间渲染,在这里确定目标元素 locator = (By.XPATH, '//span[@class="tm-price"]') close_btn = (By.XPATH,'//div[@class="sufei-dialog-content"]/div[@id="sufei-dialog-close"]') # driver.switch_to.frame("sufei-dialog-content") WebDriverWait(driver, 3,1).until(EC.presence_of_element_located(close_btn)) # driver.switch_to.frame("sufei-dialog-content") click = driver.find_element_by_xpath('//div[@class="sufei-dialog-close"]') actionchain = action_chains.ActionChains(driver) actionchain.click(click) actionchain.perform() print('点击已结束') driver.switch_to.default_content() # driver.switch_to.parent_frame() #等待网页渲染,最多等待15s,并且每1s查看一次是否出现目标元素 WebDriverWait(driver, 15, 1).until(EC.presence_of_element_located(locator)) # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called #返回请求网页后得到的源代码 return HtmlResponse(url=request.url,body=driver.page_source,request=request,encoding='utf-8',status=200) ``` _3.我尝试过分析可能是iframe的问题,但是尝试过后总是提醒 selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//div[@class="sufei-dialog-close"]"} (Session info: chrome=75.0.3770.80) ![图片说明](https://img-ask.csdn.net/upload/201908/09/1565339432_653568.jpg) 蓝色的就是想要关闭的标签 感谢帮助(●'◡'●)
用anaconda的scrapy爬取数据,按照步骤设置好了,却爬不到数据,求助大神救救菜鸟
这是运行的全部结果: (D:\Anaconda2) C:\Users\luyue>cd C:\Users\luyue\movie250 (D:\Anaconda2) C:\Users\luyue\movie250>scrapy crawl movie250 -o items.json 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: movie250) 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'movie250.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['movie250.spiders'], 'BOT_NAME': 'movie250', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'} 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-05-12 19:24:26 [scrapy.core.engine] INFO: Spider opened 2017-05-12 19:24:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-05-12 19:24:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/robots.txt> (referer: None) 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/top250/> (referer: None) 2017-05-12 19:24:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://movie.douban.com/top250/>: HTTP status code is not handled or not allowed 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Closing spider (finished) 2017-05-12 19:24:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 445, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 496, 'downloader/response_count': 2, 'downloader/response_status_count/403': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 5, 12, 11, 24, 27, 13000), 'log_count/DEBUG': 3, 'log_count/INFO': 8, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 5, 12, 11, 24, 26, 675000)} 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Spider closed (finished)
scrapy配置问题,求大家帮忙啊
配置scrapy 我是按照http://blog.csdn.net/wukaibo1986/article/details/8167590配置的 创建项目可以 但是运行项目的时候报错,做的demo是按照 http://www.oschina.net/translate/scrapy-demo做的 求解释: E:\爬虫\tutorial>scrapy crawl dmoz 2013-11-20 11:09:50+0800 [scrapy] INFO: Scrapy 0.20.0 started (bot: tutorial) 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Optional features available: ssl, http11 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'} 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState Traceback (most recent call last): File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "C:\Python27\lib\runpy.py", line 72, in _run_code exec code in run_globals File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 168, in <module> execute() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 143, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 89, in _run_print_help func(*a, **kw) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 150, in _run_command cmd.run(args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\commands\crawl.py", line 50, in run self.crawler_process.start() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 92, in start if self.start_crawling(): File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 124, in start_crawling return self._start_crawler() is not None File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 139, in _start_crawler crawler.configure() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 47, in configure self.engine = ExecutionEngine(self, self._spider_closed) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\engine.py", line 63, in __init__ self.downloader = Downloader(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\__init__.py", line 73, in __init__ self.handlers = DownloadHandlers(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 18, in __init__ cls = load_object(clspath) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\utils\misc.py", line 40, in load_object mod = import_module(module) File "C:\Python27\lib\importlib\__init__.py", line 37, in import_module __import__(name) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\s3.py", line 4, in <module> from .http import HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http.py", line 5, in <module> from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 17, in <module> from scrapy.responsetypes import responsetypes File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 113, in <module> responsetypes = ResponseTypes() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 34, in __init__ self.mimetypes = MimeTypes() File "C:\Python27\lib\mimetypes.py", line 66, in __init__ init() File "C:\Python27\lib\mimetypes.py", line 358, in init db.read_windows_registry() File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry for subkeyname in enum_types(hkcr): File "C:\Python27\lib\mimetypes.py", line 249, in enum_types ctype = ctype.encode(default_encoding) # omit in 3.x! UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 9: ordinal not in range(128)
python代码苍穹平台数据抓取
原文地址:https://github.com/yiyuezhuo/cangqiong-scratch http://v.kuaidadi.com/ 在上面这个网站平台抓取数据,为什么只有10个城市的数据可以抓取数据,其他的就不行呢?原文说10个城市可以抓取,但是我觉得应该通用的,知道区号不就可以获取相应的数据了吗? 代码如下: ``` # -*- coding: utf-8 -*- """ Created on Thu Mar 17 12:15:08 2016 @author: yiyuezhuo """ ''' cityId:510100 scope:city date:3 dimension:satisfy num:300 ''' import requests import json import pandas as pd import os def get(cityId='510100',scope='city',date='3',dimension='satisfy',num=1000): url='http://v.kuaidadi.com/point' params={'cityId':cityId,'scope':scope,'date':date,'dimension':dimension,'num':num} res=requests.get(url,params=params) print (res.content) return json.loads(res.content.decode()) class Downloader(object): def __init__(self,cityId_list='441300'): self.cityId_list=cityId_list if cityId_list!=None else ['510100'] self.scope_list=['city'] self.date_list=[str(i) for i in range(7)] self.dimension_list=['distribute','satisfy','demand','response','money'] # money好像get字段不太一样,不过暂且用一样的方法请求 self.num_list=[1000] self.pkey=('cityId','scope','date','dimension','num') self.data={} def keys(self): for cityId in self.cityId_list: for scope in self.scope_list: for date in self.date_list: for dimension in self.dimension_list: for num in self.num_list: yield (cityId,scope,date,dimension,num) def download(self,verbose=True): for key in self.keys(): pkey=self.pkey params=dict(zip(pkey,key)) self.data[key]=get(**params) if verbose: print('clear',key) def to_csv(key,json_d,prefix='data/'): data=json_d['result']['data'] city_id=json_d['result']['cityID'] date=json_d['result']['date'] dimension=key[3] fname='_'.join([dimension,date,city_id,'.csv']) fname=fname.replace('/','.') fname=prefix+fname cdata=[] for hour,section in enumerate(data): for record in section: cdata.append([hour]+record[1:]) df=pd.DataFrame(cdata,columns=['hour','longitude','latitude','value']) df.to_csv(fname) def to_csv_all(datas,path='data/'): for key,json_d in datas.items(): to_csv(key,json_d,prefix=path) def run(city,path='data'): if not os.path.isdir(path): print('create dir path',path) os.mkdir(path) downloader=Downloader([city]) downloader.download() to_csv_all(downloader.data,path=path+'/') def CLI(): import argparse parser = argparse.ArgumentParser(usage=u'python main.py 510100', description=u"苍穹平台数据抓取器") parser.add_argument('city',help=u'城市序号,成都是510100,其他ID参见cityId.json文件') parser.add_argument('--dir',default='data',help=u'保存路径,默认为data') args=parser.parse_args() run(args.city,args.dir) if __name__=='__main__': import sys if len(sys.argv)>1: CLI() ''' downloader=Downloader() downloader.download() to_csv_all(downloader.data) ''' ```
python笔趣阁报错:SyntaxError: invalid syntax
自己在论坛上面找了一份python3爬虫的代码,但是比照着写就出现了上面的问题,求助大家帮我看一下。 import requests from bs4 import BeautifulSoup """ 说明:下载《笔趣阁》小说《一念永恒》 parameter: 无 Return: 无 Modify: 2019-06-27 """ class downloader(object): def _init_(self): self.server='https://www.biqukan.com/' self.url='https://www.biqukan.com/1_1094/' self.name=[] self.urls=[] self.nums=0 """ 函数说明:获取下载链接 Parameters: 无 Returns: 无 Modify: 2019-06-27 """ def get_download_url(self): resp = requests.get(url) html=resp.text resp.encoding=resp.apparent_encoding if html: with open('test.html',mode='a+',encoding=resp.apparent_encoding) as file: file.write(html) div_bf = BeautifulSoup(html) div=div_bf.find_all('div', class_ = 'listmain') a_bf = BeautifulSoup(str(div[0])) a = a_bf.find_all('a') self.nums=len(a[15:]) for each in a[15:]: self.names.append(each.string) self.urls.append(self.server+each.get('href') """ 函数说明:获取章节内容 Parameters: url - 下载连接(string) Returns: texts - 章节内容(string) Modify: 2019-6-27 """ def get_contents(self, url): req = requests.get(url) html = resp.text bf = BeautifulSoup(html) texts = bf.find_all('div', class_ = 'showtxt') texts = texts[0].text.replace('\xa0'*8,'\n\n') return texts """ 函数说明:将爬取的文章内容写入文件 Parameters: name - 章节名称(string) path - 当前路径下,小说保存名称(string) text - 章节内容(string) Returns: 无 Modify: 2019-06-27 """ def writer(self, name, path, text): write_flag = True with open(path, 'a', encoding='utf-8') as f: f.write(name + '\n') f.writelines(text) f.write('\n\n') dl = downloader() dl.get_download_url() print('《一年永恒》开始下载:') for i in range(dl.nums): dl.writer(dl.names[i], '一念永恒.txt', dl.get_contents(dl.urls[i])) sys.stdout.write("已下载:%.3f%%" % float(i/dl.nums) + '\r') sys.stdout.flush() print('《一年永恒》下载完成')
请问python中调用类的方法怎么调用
请问在类里怎么调用类的函数,我想把canshu里返回的数据在order里打印出来,比如我这么写 class DailishiyanDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def canshu(self):#数据库返回数据在这个函数 aa=["192.168.1.2","11.22.33","44,55,66"] return aa b=canshu(1) print("我在函数外",b) def order(self):#将返回数据按顺序输出 print("我来自order") for i in range(10): yield i a=order(1) @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s ........ 就能正常输出,但我如果这么写,就会报错AttributeError: 'int' object has no attribute 'canshu' class DailishiyanDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def canshu(self):#数据库返回数据在这个函数 aa=["192.168.1.2","11.22.33","44,55,66"] return aa b=canshu(1) #print("我在函数外",b) def order(self):#将返回数控按顺序输出Ss print("我来自order",self.canshu()) for i in range(10): yield i a=order(1) @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s ...... 省略的部分是scrapy框架的默认代码,基本没有更改 毫无头绪的bug,求帮助,感谢
scrapy 运行抛出NotImplementedError,请问一般什么原因造成呢?
/usr/bin/python3.5 /home/pzs/PycharmProjects/News/main.py 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: News) 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'News', 'SPIDER_MODULES': ['News.spiders'], 'NEWSPIDER_MODULE': 'News.spiders'} 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole',  'scrapy.extensions.corestats.CoreStats',  'scrapy.extensions.logstats.LogStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',  'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  'scrapy.spidermiddlewares.referer.RefererMiddleware',  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled item pipelines: ['News.pipelines.MysqlPipeline'] 2017-04-08 11:00:12 [scrapy.core.engine] INFO: Spider opened 2017-04-08 11:00:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 11:00:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-08 11:00:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://18.92.0.1/contents/7/121174.html> (referer: None) 2017-04-08 11:00:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://18.92.0.1/contents/7/121174.html> (referer: None) Traceback (most recent call last):   File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks     current.result = callback(current.result, *args, **kw)   File "/usr/local/lib/python3.5/dist-packages/scrapy/spiders/__init__.py", line 76, in parse     raise NotImplementedError NotImplementedError 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-08 11:00:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229,  'downloader/request_count': 1,  'downloader/request_method_count/GET': 1,  'downloader/response_bytes': 16609,  'downloader/response_count': 1,  'downloader/response_status_count/200': 1,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 4, 8, 18, 0, 13, 938637),  'log_count/DEBUG': 2,  'log_count/ERROR': 1,  'log_count/INFO': 7,  'response_received_count': 1,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'spider_exceptions/NotImplementedError': 1,  'start_time': datetime.datetime(2017, 4, 8, 18, 0, 12, 917719)} 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 直接运行会弹出NotImplementedError错误,单步调试也看不出到底哪里出了问题
非常简单的scrapy代码但就是不清楚到底哪里出问题了,高手帮忙看看吧!
News_spider文件 # -*- coding: utf-8 -*- import scrapy import re from scrapy import Selector from News.items import NewsItem class NewsSpiderSpider(scrapy.Spider): name = "news_spider" allowed_domains = ["http://18.92.0.1"] start_urls = ['http://18.92.0.1/contents/7/121174.html'] def parse_detail(self, response): sel = Selector(response) items = [] item = NewsItem() item['title'] = sel.css('.div_bt::text').extract()[0] characters = sel.css('.div_zz::text').extract()[0].replace("\xa0","") pattern = re.compile('[:].*[ ]') result = pattern.search(characters) item['post'] = result.group().replace(":","").strip() pattern = re.compile('[ ][^发]*') result = pattern.search(characters) item['approver'] = result.group() pattern = re.compile('[201].{9}') result = pattern.search(characters) item['date_of_publication'] = result.group() pattern = re.compile('([0-9]+)$') result = pattern.search(characters) item['browse_times'] = result.group() content = sel.css('.xwnr').extract()[0] pattern = re.compile('[\u4e00-\u9fa5]|[,、。“”]') result = pattern.findall(content) item['content'] = ''.join(result).replace("仿宋"," ").replace("宋体"," ").replace("楷体"," ") item['img1_url'] = sel.xpath('//*[@id="newpic"]/div[1]/div[1]/img/@src').extract()[0] item['img1_name'] = sel.xpath('//*[@id="newpic"]/div[1]/div[2]/text()').extract()[0] item['img2_url'] = sel.xpath('//*[@id="newpic"]/div[2]/div[1]/img/@src').extract()[0] item['img2_name'] = sel.xpath('//*[@id="newpic"]/div[2]/div[2]').extract()[0] item['img3_url'] = sel.xpath('//*[@id="newpic"]/div[3]/div[1]/img/@src').extract()[0] item['img3_name'] = sel.xpath('//*[@id="newpic"]/div[3]/div[2]/text()').extract()[0] item['img4_url'] = sel.xpath('//*[@id="newpic"]/div[4]/div[1]/img/@src').extract()[0] item['img4_name'] = sel.xpath('//*[@id="newpic"]/div[4]/div[2]/text()').extract()[0] item['img5_url'] = sel.xpath('//*[@id="newpic"]/div[5]/div[1]/img/@src').extract()[0] item['img5_name'] = sel.xpath('//*[@id="newpic"]/div[5]/div[2]/text()').extract()[0] item['img6_url'] = sel.xpath('//*[@id="newpic"]/div[6]/div[1]/img/@src').extract()[0] item['img6_name'] = sel.xpath('//*[@id="newpic"]/div[6]/div[2]/text()').extract()[0] characters = sel.xpath('/html/body/div/div[2]/div[4]/div[4]/text()').extract()[0].replace("\xa0","") pattern = re.compile('[:].*?[ ]') result = pattern.search(characters) item['company'] = result.group().replace(":", "").strip() pattern = re.compile('[ ][^联]*') result = pattern.search(characters) item['writer_photography'] = result.group() pattern = re.compile('(([0-9]|[-])+)$') result = pattern.search(characters) item['tel'] = result.group() items.append(item) items文件 return items import scrapy class NewsItem(scrapy.Item): title = scrapy.Field() post = scrapy.Field() approver = scrapy.Field() date_of_publication = scrapy.Field() browse_times = scrapy.Field() content = scrapy.Field() img1_url = scrapy.Field() img1_name = scrapy.Field() img2_url = scrapy.Field() img2_name = scrapy.Field() img3_url = scrapy.Field() img3_name = scrapy.Field() img4_url = scrapy.Field() img4_name = scrapy.Field() img5_url = scrapy.Field() img5_name = scrapy.Field() img6_url = scrapy.Field() img6_name = scrapy.Field() company = scrapy.Field() writer_photography = scrapy.Field() tel = scrapy.Field() pipelines文件 import MySQLdb import MySQLdb.cursors class NewsPipeline(object): def process_item(self, item, spider): return item class MysqlPipeline(object): def __init__(self): self.conn = MySQLdb.connect('192.168.254.129','root','root','news',charset="utf8",use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = "insert into news_table(title,post,approver,date_of_publication,browse_times,content,img1_url,img1_name,img2_url,img2_name,img3_url,img3_name,img4_url,img4_name,img5_url,img5_name,img6_url,img6_name,company,writer_photography,tel)VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)" self.cursor.execute(insert_sql,(item['title'],item['post'],item['approver'],item['date_of_publication'],item['browse_times'],item['content'],item['img1_url'],item['img1_name'],item['img1_url'],item['img1_name'],item['img2_url'],item['img2_name'],item['img3_url'],item['img3_name'],item['img4_url'],item['img4_name'],item['img5_url'],item['img5_name'],item['img6_url'],item['img6_name'],item['company'],item['writer_photography'],item['tel'])) self.conn.commit() setting文件 BOT_NAME = 'News' SPIDER_MODULES = ['News.spiders'] NEWSPIDER_MODULE = 'News.spiders' ROBOTSTXT_OBEY = False COOKIES_ENABLED = True ITEM_PIPELINES = { #'News.pipelines.NewsPipeline': 300, 'News.pipelines.MysqlPipeline': 300, } /usr/bin/python3.5 /home/pzs/PycharmProjects/News/main.py 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: News) 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'News', 'SPIDER_MODULES': ['News.spiders'], 'NEWSPIDER_MODULE': 'News.spiders'} 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled item pipelines: ['News.pipelines.MysqlPipeline'] 2017-04-08 11:00:12 [scrapy.core.engine] INFO: Spider opened 2017-04-08 11:00:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 11:00:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-08 11:00:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://18.92.0.1/contents/7/121174.html> (referer: None) 2017-04-08 11:00:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://18.92.0.1/contents/7/121174.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.5/dist-packages/scrapy/spiders/__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-08 11:00:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 16609, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 8, 18, 0, 13, 938637), 'log_count/DEBUG': 2, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 4, 8, 18, 0, 12, 917719)} 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 直接运行会弹出NotImplementedError错误,单步调试也看不出到底哪里出了问题
cloudera manager 离线安装安装agent时,向主节点下载资源超时
错误日志: [19/Nov/2018 16:16:04 +0000] 2789 MainThread stacks_collection_manager INFO Using max_uncompressed_file_size_bytes: 5242880 [19/Nov/2018 16:16:04 +0000] 2789 MainThread __init__ INFO Importing metric schema from file /opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/monitor/schema.json [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Supervised processes will add the following to their environment (in addition to the supervisor's env): {'CDH_PARQUET_HOME': '/usr/lib/parquet', 'JSVC_HOME': '/usr/libexec/bigtop-utils', 'CMF_PACKAGE_DIR': '/opt/cloudera-manager/cm-5.10.2/lib64/cmf/service', 'CDH_HADOOP_BIN': '/usr/bin/hadoop', 'MGMT_HOME': '/opt/cloudera-manager/cm-5.10.2/share/cmf', 'CDH_IMPALA_HOME': '/usr/lib/impala', 'CDH_YARN_HOME': '/usr/lib/hadoop-yarn', 'CDH_HDFS_HOME': '/usr/lib/hadoop-hdfs', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'CDH_HUE_PLUGINS_HOME': '/usr/lib/hadoop', 'CM_STATUS_CODES': u'STATUS_NONE HDFS_DFS_DIR_NOT_EMPTY HBASE_TABLE_DISABLED HBASE_TABLE_ENABLED JOBTRACKER_IN_STANDBY_MODE YARN_RM_IN_STANDBY_MODE', 'KEYTRUSTEE_KP_HOME': '/usr/share/keytrustee-keyprovider', 'CLOUDERA_ORACLE_CONNECTOR_JAR': '/usr/share/java/oracle-connector-java.jar', 'CDH_SQOOP2_HOME': '/usr/lib/sqoop2', 'KEYTRUSTEE_SERVER_HOME': '/usr/lib/keytrustee-server', 'CDH_MR2_HOME': '/usr/lib/hadoop-mapreduce', 'HIVE_DEFAULT_XML': '/etc/hive/conf.dist/hive-default.xml', 'CLOUDERA_POSTGRESQL_JDBC_JAR': '/opt/cloudera-manager/cm-5.10.2/share/cmf/lib/postgresql-9.0-801.jdbc4.jar', 'CDH_KMS_HOME': '/usr/lib/hadoop-kms', 'CDH_HBASE_HOME': '/usr/lib/hbase', 'CDH_SQOOP_HOME': '/usr/lib/sqoop', 'WEBHCAT_DEFAULT_XML': '/etc/hive-webhcat/conf.dist/webhcat-default.xml', 'CDH_OOZIE_HOME': '/usr/lib/oozie', 'CDH_ZOOKEEPER_HOME': '/usr/lib/zookeeper', 'CDH_HUE_HOME': '/usr/lib/hue', 'CLOUDERA_MYSQL_CONNECTOR_JAR': '/usr/share/java/mysql-connector-java.jar', 'CDH_HBASE_INDEXER_HOME': '/usr/lib/hbase-solr', 'CDH_MR1_HOME': '/usr/lib/hadoop-0.20-mapreduce', 'CDH_SOLR_HOME': '/usr/lib/solr', 'CDH_PIG_HOME': '/usr/lib/pig', 'CDH_SENTRY_HOME': '/usr/lib/sentry', 'CDH_CRUNCH_HOME': '/usr/lib/crunch', 'CDH_LLAMA_HOME': '/usr/lib/llama/', 'CDH_HTTPFS_HOME': '/usr/lib/hadoop-httpfs', 'ROOT': '/opt/cloudera-manager/cm-5.10.2/lib64/cmf', 'CDH_HADOOP_HOME': '/usr/lib/hadoop', 'CDH_HIVE_HOME': '/usr/lib/hive', 'ORACLE_HOME': '/usr/share/oracle/instantclient', 'CDH_HCAT_HOME': '/usr/lib/hive-hcatalog', 'CDH_KAFKA_HOME': '/usr/lib/kafka', 'CDH_SPARK_HOME': '/usr/lib/spark', 'TOMCAT_HOME': '/usr/lib/bigtop-tomcat', 'CDH_FLUME_HOME': '/usr/lib/flume-ng'} [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO To override these variables, use /etc/cloudera-scm-agent/config.ini. Environment variables for CDH locations are not used when CDH is installed from parcels. [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Created /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/process [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Chmod'ing /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/process to 0751 [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Created /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/supervisor [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Chmod'ing /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/supervisor to 0751 [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Created /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/flood [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Chowning /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/flood to cloudera-scm (498) cloudera-scm (498) [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Chmod'ing /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/flood to 0751 [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Created /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/supervisor/include [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent INFO Chmod'ing /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/supervisor/include to 0751 [19/Nov/2018 16:16:04 +0000] 2789 MainThread agent ERROR Failed to connect to previous supervisor. Traceback (most recent call last): File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/agent.py", line 2073, in find_or_start_supervisor self.configure_supervisor_clients() File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/agent.py", line 2254, in configure_supervisor_clients supervisor_options.realize(args=["-c", os.path.join(self.supervisor_dir, "supervisord.conf")]) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 1599, in realize Options.realize(self, *arg, **kw) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 333, in realize self.process_config() File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 341, in process_config self.process_config_file(do_usage) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 376, in process_config_file self.usage(str(msg)) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 164, in usage self.exit(2) SystemExit: 2 [19/Nov/2018 16:16:04 +0000] 2789 MainThread tmpfs INFO Successfully mounted tmpfs at /opt/cloudera-manager/cm-5.10.2/run/cloudera-scm-agent/process [19/Nov/2018 16:16:05 +0000] 2789 MainThread agent INFO Trying to connect to newly launched supervisor (Attempt 1) [19/Nov/2018 16:16:05 +0000] 2789 MainThread agent INFO Supervisor version: 3.0, pid: 2821 [19/Nov/2018 16:16:05 +0000] 2789 MainThread agent INFO Successfully connected to supervisor [19/Nov/2018 16:16:05 +0000] 2789 MainThread status_server INFO Using maximum impala profile bundle size of 1073741824 bytes. [19/Nov/2018 16:16:05 +0000] 2789 MainThread status_server INFO Using maximum stacks log bundle size of 1073741824 bytes. [19/Nov/2018 16:16:05 +0000] 2789 MainThread _cplogging INFO [19/Nov/2018:16:16:05] ENGINE Bus STARTING [19/Nov/2018 16:16:05 +0000] 2789 MainThread _cplogging INFO [19/Nov/2018:16:16:05] ENGINE Started monitor thread '_TimeoutMonitor'. [19/Nov/2018 16:16:06 +0000] 2789 MainThread _cplogging INFO [19/Nov/2018:16:16:06] ENGINE Serving on yingzhi01.com:9000 [19/Nov/2018 16:16:06 +0000] 2789 MainThread _cplogging INFO [19/Nov/2018:16:16:06] ENGINE Bus STARTED [19/Nov/2018 16:16:06 +0000] 2789 MainThread __init__ INFO New monitor: (<cmf.monitor.host.HostMonitor object at 0x2990c50>,) [19/Nov/2018 16:16:06 +0000] 2789 MonitorDaemon-Scheduler __init__ INFO Monitor ready to report: ('HostMonitor',) [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Setting default socket timeout to 30 [19/Nov/2018 16:16:06 +0000] 2789 Monitor-HostMonitor network_interfaces INFO NIC iface eth0 doesn't support ETHTOOL (95) [19/Nov/2018 16:16:06 +0000] 2789 Monitor-HostMonitor throttling_logger ERROR Error getting directory attributes for /opt/cloudera-manager/cm-5.10.2/log/cloudera-scm-agent Traceback (most recent call last): File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/monitor/dir_monitor.py", line 90, in _get_directory_attributes name = pwd.getpwuid(uid)[0] KeyError: 'getpwuid(): uid not found: 1106' [19/Nov/2018 16:16:06 +0000] 2789 MainThread heartbeat_tracker INFO HB stats (seconds): num:1 LIFE_MIN:0.22 min:0.22 mean:0.22 max:0.22 LIFE_MAX:0.22 [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO CM server guid: dceeafae-a884-42f1-ba7b-4ee187ef3bef [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Using parcels directory from server provided value: /opt/cloudera/parcels [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent WARNING Expected user root for /opt/cloudera/parcels but was cloudera-scm [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent WARNING Expected group root for /opt/cloudera/parcels but was cloudera-scm [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Created /opt/cloudera/parcel-cache [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Chowning /opt/cloudera/parcel-cache to root (0) root (0) [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Chmod'ing /opt/cloudera/parcel-cache to 0755 [19/Nov/2018 16:16:06 +0000] 2789 MainThread parcel INFO Agent does create users/groups and apply file permissions [19/Nov/2018 16:16:06 +0000] 2789 MainThread downloader INFO Downloader path: /opt/cloudera/parcel-cache [19/Nov/2018 16:16:06 +0000] 2789 MainThread parcel_cache INFO Using /opt/cloudera/parcel-cache for parcel cache [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Flood daemon (re)start attempt [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Created /opt/cloudera/parcels/.flood [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Chowning /opt/cloudera/parcels/.flood to cloudera-scm (498) cloudera-scm (498) [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Chmod'ing /opt/cloudera/parcels/.flood to 0755 [19/Nov/2018 16:16:06 +0000] 2789 MainThread agent INFO Triggering supervisord update. [19/Nov/2018 16:16:36 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:16:36 +0000] 2789 MainThread agent INFO Active parcel list updated; recalculating component info. [19/Nov/2018 16:16:36 +0000] 2789 MainThread throttling_logger WARNING CMF_AGENT_JAVA_HOME environment variable host override will be deprecated in future. JAVA_HOME setting configured from CM server takes precedence over host agent override. Configure JAVA_HOME setting from CM server. [19/Nov/2018 16:16:36 +0000] 2789 MainThread throttling_logger INFO Identified java component java8 with full version JAVA_HOME=/opt/modules/jdk1.8.0_144 java version "1.8.0_144" Java(TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) for requested version . [19/Nov/2018 16:16:36 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.6659779549 [19/Nov/2018 16:16:36 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:16:44 +0000] 2789 Monitor-HostMonitor throttling_logger ERROR Timeout with args ['ntpdc', '-np'] None [19/Nov/2018 16:16:44 +0000] 2789 Monitor-HostMonitor throttling_logger ERROR Failed to collect NTP metrics Traceback (most recent call last): File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/monitor/host/ntp_monitor.py", line 48, in collect self.collect_ntpd() File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/monitor/host/ntp_monitor.py", line 66, in collect_ntpd result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/monitor/host/ntp_monitor.py", line 38, in _subprocess_with_timeout return subprocess_with_timeout(args, timeout) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/subprocess_timeout.py", line 94, in subprocess_with_timeout raise Exception("timeout with args %s" % args) Exception: timeout with args ['ntpdc', '-np'] [19/Nov/2018 16:17:06 +0000] 2789 DnsResolutionMonitor throttling_logger INFO Using java location: '/opt/modules/jdk1.8.0_144/bin/java'. [19/Nov/2018 16:17:06 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:17:06 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1082139015 [19/Nov/2018 16:17:06 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:17:36 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:17:36 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1235852242 [19/Nov/2018 16:17:36 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:18:07 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:18:07 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1040799618 [19/Nov/2018 16:18:07 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:18:37 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:18:37 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1849529743 [19/Nov/2018 16:18:37 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:19:07 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:19:07 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1211960316 [19/Nov/2018 16:19:07 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:19:37 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:19:37 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1215620041 [19/Nov/2018 16:19:37 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:20:01 +0000] 2789 CP Server Thread-4 _cplogging INFO 192.168.164.35 - - [19/Nov/2018:16:20:01] "GET /heartbeat HTTP/1.1" 200 2 "" "NING/1.0" [19/Nov/2018 16:20:04 +0000] 2789 CP Server Thread-5 _cplogging INFO 192.168.164.35 - - [19/Nov/2018:16:20:04] "GET /heartbeat HTTP/1.1" 200 2 "" "NING/1.0" [19/Nov/2018 16:20:07 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:20:07 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1212861538 [19/Nov/2018 16:20:07 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:20:37 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:20:37 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1753029823 [19/Nov/2018 16:20:37 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:20:37 +0000] 2789 Thread-13 downloader INFO Fetching torrent: http://yingzhi01.com:7180/cmf/parcel/download/CDH-5.10.2-1.cdh5.10.2.p0.5-el6.parcel.torrent [19/Nov/2018 16:20:37 +0000] 2789 Thread-13 downloader INFO Starting download of: http://yingzhi01.com:7180/cmf/parcel/download/CDH-5.10.2-1.cdh5.10.2.p0.5-el6.parcel [19/Nov/2018 16:21:07 +0000] 2789 Thread-13 downloader ERROR Unexpected exception during download Traceback (most recent call last): File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/cmf/downloader.py", line 279, in download self.client.AddTorrent(torrent_url) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/flood/util/cmd.py", line 159, in __call__ return self.fn.__get__(self.binding)(*args, **kwargs) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/flood/util/rpc.py", line 68, in <lambda> return lambda *pargs, **kwargs: self._invoke(*pargs, **kwargs) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/flood/util/rpc.py", line 77, in _invoke return rpcClient.requestor.request(self.schema.name, msg) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/flood/util/rpc.py", line 129, in requestor return avro.ipc.Requestor(self.SCHEMA, self.transceiver) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/cmf-5.10.2-py2.6.egg/flood/util/rpc.py", line 125, in transceiver return avro.ipc.HTTPTransceiver(self.server.host, self.server.port) File "/opt/cloudera-manager/cm-5.10.2/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 469, in __init__ self.conn.connect() File "/usr/lib64/python2.6/httplib.py", line 771, in connect self.timeout) File "/usr/lib64/python2.6/socket.py", line 567, in create_connection raise error, msg timeout: timed out [19/Nov/2018 16:21:07 +0000] 2789 Thread-13 downloader INFO Finished download [ url: http://yingzhi01.com:7180/cmf/parcel/download/CDH-5.10.2-1.cdh5.10.2.p0.5-el6.parcel, state: exception, total_bytes: 0, downloaded_bytes: 0, start_time: 2018-11-19 16:20:37, download_end_time: , end_time: 2018-11-19 16:21:07, code: 600, exception_msg: timed out, path: None ] [19/Nov/2018 16:21:07 +0000] 2789 MainThread downloader ERROR Failed rack peer update: timed out [19/Nov/2018 16:21:07 +0000] 2789 MainThread agent WARNING Long HB processing time: 30.1247620583 [19/Nov/2018 16:21:07 +0000] 2789 MainThread agent WARNING Delayed HB: 15s since last [19/Nov/2018 16:21:07 +0000] 2789 Thread-13 downloader INFO Fetching torrent: http://yingzhi01.com:7180/cmf/parcel/download/CDH-5.10.2-1.cdh5.10.2.p0.5-el6.parcel.torrent [19/Nov/2018 16:21:08 +0000] 2789 Thread-13 downloader INFO Starting download of: http://yingzhi01.com:7180/cmf/parcel/download/CDH-5.10.2-1.cdh5.10.2.p0.5-el6.parcel [19/Nov/2018 16:21:38 +0000] 2789 Thread-13 downloader ERROR Unexpected exception during download 然后就是不断重复超时错误求大神指点。。。
Java学习的正确打开方式
在博主认为,对于入门级学习java的最佳学习方法莫过于视频+博客+书籍+总结,前三者博主将淋漓尽致地挥毫于这篇博客文章中,至于总结在于个人,实际上越到后面你会发现学习的最好方式就是阅读参考官方文档其次就是国内的书籍,博客次之,这又是一个层次了,这里暂时不提后面再谈。博主将为各位入门java保驾护航,各位只管冲鸭!!!上天是公平的,只要不辜负时间,时间自然不会辜负你。 何谓学习?博主所理解的学习,它是一个过程,是一个不断累积、不断沉淀、不断总结、善于传达自己的个人见解以及乐于分享的过程。
程序员必须掌握的核心算法有哪些?
由于我之前一直强调数据结构以及算法学习的重要性,所以就有一些读者经常问我,数据结构与算法应该要学习到哪个程度呢?,说实话,这个问题我不知道要怎么回答你,主要取决于你想学习到哪些程度,不过针对这个问题,我稍微总结一下我学过的算法知识点,以及我觉得值得学习的算法。这些算法与数据结构的学习大多数是零散的,并没有一本把他们全部覆盖的书籍。下面是我觉得值得学习的一些算法以及数据结构,当然,我也会整理一些看过...
大学四年自学走来,这些私藏的实用工具/学习网站我贡献出来了
大学四年,看课本是不可能一直看课本的了,对于学习,特别是自学,善于搜索网上的一些资源来辅助,还是非常有必要的,下面我就把这几年私藏的各种资源,网站贡献出来给你们。主要有:电子书搜索、实用工具、在线视频学习网站、非视频学习网站、软件下载、面试/求职必备网站。 注意:文中提到的所有资源,文末我都给你整理好了,你们只管拿去,如果觉得不错,转发、分享就是最大的支持了。 一、电子书搜索 对于大部分程序员...
linux系列之常用运维命令整理笔录
本博客记录工作中需要的linux运维命令,大学时候开始接触linux,会一些基本操作,可是都没有整理起来,加上是做开发,不做运维,有些命令忘记了,所以现在整理成博客,当然vi,文件操作等就不介绍了,慢慢积累一些其它拓展的命令,博客不定时更新 free -m 其中:m表示兆,也可以用g,注意都要小写 Men:表示物理内存统计 total:表示物理内存总数(total=used+free) use...
比特币原理详解
一、什么是比特币 比特币是一种电子货币,是一种基于密码学的货币,在2008年11月1日由中本聪发表比特币白皮书,文中提出了一种去中心化的电子记账系统,我们平时的电子现金是银行来记账,因为银行的背后是国家信用。去中心化电子记账系统是参与者共同记账。比特币可以防止主权危机、信用风险。其好处不多做赘述,这一层面介绍的文章很多,本文主要从更深层的技术原理角度进行介绍。 二、问题引入 假设现有4个人...
程序员接私活怎样防止做完了不给钱?
首先跟大家说明一点,我们做 IT 类的外包开发,是非标品开发,所以很有可能在开发过程中会有这样那样的需求修改,而这种需求修改很容易造成扯皮,进而影响到费用支付,甚至出现做完了项目收不到钱的情况。 那么,怎么保证自己的薪酬安全呢? 我们在开工前,一定要做好一些证据方面的准备(也就是“讨薪”的理论依据),这其中最重要的就是需求文档和验收标准。一定要让需求方提供这两个文档资料作为开发的基础。之后开发...
网页实现一个简单的音乐播放器(大佬别看。(⊙﹏⊙))
今天闲着无事,就想写点东西。然后听了下歌,就打算写个播放器。 于是乎用h5 audio的加上js简单的播放器完工了。 演示地点演示 html代码如下` music 这个年纪 七月的风 音乐 ` 然后就是css`*{ margin: 0; padding: 0; text-decoration: none; list-...
Python十大装B语法
Python 是一种代表简单思想的语言,其语法相对简单,很容易上手。不过,如果就此小视 Python 语法的精妙和深邃,那就大错特错了。本文精心筛选了最能展现 Python 语法之精妙的十个知识点,并附上详细的实例代码。如能在实战中融会贯通、灵活使用,必将使代码更为精炼、高效,同时也会极大提升代码B格,使之看上去更老练,读起来更优雅。
数据库优化 - SQL优化
以实际SQL入手,带你一步一步走上SQL优化之路!
2019年11月中国大陆编程语言排行榜
2019年11月2日,我统计了某招聘网站,获得有效程序员招聘数据9万条。针对招聘信息,提取编程语言关键字,并统计如下: 编程语言比例 rank pl_ percentage 1 java 33.62% 2 cpp 16.42% 3 c_sharp 12.82% 4 javascript 12.31% 5 python 7.93% 6 go 7.25% 7 p...
通俗易懂地给女朋友讲:线程池的内部原理
餐盘在灯光的照耀下格外晶莹洁白,女朋友拿起红酒杯轻轻地抿了一小口,对我说:“经常听你说线程池,到底线程池到底是个什么原理?”
《奇巧淫技》系列-python!!每天早上八点自动发送天气预报邮件到QQ邮箱
将代码部署服务器,每日早上定时获取到天气数据,并发送到邮箱。 也可以说是一个小型人工智障。 知识可以运用在不同地方,不一定非是天气预报。
经典算法(5)杨辉三角
杨辉三角 是经典算法,这篇博客对它的算法思想进行了讲解,并有完整的代码实现。
英特尔不为人知的 B 面
从 PC 时代至今,众人只知在 CPU、GPU、XPU、制程、工艺等战场中,英特尔在与同行硬件芯片制造商们的竞争中杀出重围,且在不断的成长进化中,成为全球知名的半导体公司。殊不知,在「刚硬」的背后,英特尔「柔性」的软件早已经做到了全方位的支持与支撑,并持续发挥独特的生态价值,推动产业合作共赢。 而对于这一不知人知的 B 面,很多人将其称之为英特尔隐形的翅膀,虽低调,但是影响力却不容小觑。 那么,在...
腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹?
昨天,有网友私信我,说去阿里面试,彻底的被打击到了。问了为什么网上大量使用ThreadLocal的源码都会加上private static?他被难住了,因为他从来都没有考虑过这个问题。无独有偶,今天笔者又发现有网友吐槽了一道腾讯的面试题,我们一起来看看。 腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹? 在互联网职场论坛,一名程序员发帖求助到。二面腾讯,其中一个算法题:64匹...
面试官:你连RESTful都不知道我怎么敢要你?
干货,2019 RESTful最贱实践
刷了几千道算法题,这些我私藏的刷题网站都在这里了!
遥想当年,机缘巧合入了 ACM 的坑,周边巨擘林立,从此过上了"天天被虐似死狗"的生活… 然而我是谁,我可是死狗中的战斗鸡,智力不够那刷题来凑,开始了夜以继日哼哧哼哧刷题的日子,从此"读题与提交齐飞, AC 与 WA 一色 ",我惊喜的发现被题虐既刺激又有快感,那一刻我泪流满面。这么好的事儿作为一个正直的人绝不能自己独享,经过激烈的颅内斗争,我决定把我私藏的十几个 T 的,阿不,十几个刷题网...
为啥国人偏爱Mybatis,而老外喜欢Hibernate/JPA呢?
关于SQL和ORM的争论,永远都不会终止,我也一直在思考这个问题。昨天又跟群里的小伙伴进行了一番讨论,感触还是有一些,于是就有了今天这篇文。 声明:本文不会下关于Mybatis和JPA两个持久层框架哪个更好这样的结论。只是摆事实,讲道理,所以,请各位看官勿喷。 一、事件起因 关于Mybatis和JPA孰优孰劣的问题,争论已经很多年了。一直也没有结论,毕竟每个人的喜好和习惯是大不相同的。我也看...
白话阿里巴巴Java开发手册高级篇
不久前,阿里巴巴发布了《阿里巴巴Java开发手册》,总结了阿里巴巴内部实际项目开发过程中开发人员应该遵守的研发流程规范,这些流程规范在一定程度上能够保证最终的项目交付质量,通过在时间中总结模式,并推广给广大开发人员,来避免研发人员在实践中容易犯的错误,确保最终在大规模协作的项目中达成既定目标。 无独有偶,笔者去年在公司里负责升级和制定研发流程、设计模板、设计标准、代码标准等规范,并在实际工作中进行...
SQL-小白最佳入门sql查询一
不要偷偷的查询我的个人资料,即使你再喜欢我,也不要这样,真的不好;
项目中的if else太多了,该怎么重构?
介绍 最近跟着公司的大佬开发了一款IM系统,类似QQ和微信哈,就是聊天软件。我们有一部分业务逻辑是这样的 if (msgType = "文本") { // dosomething } else if(msgType = "图片") { // doshomething } else if(msgType = "视频") { // doshomething } else { // doshom...
Nginx 原理和架构
Nginx 是一个免费的,开源的,高性能的 HTTP 服务器和反向代理,以及 IMAP / POP3 代理服务器。Nginx 以其高性能,稳定性,丰富的功能,简单的配置和低资源消耗而闻名。 Nginx 的整体架构 Nginx 里有一个 master 进程和多个 worker 进程。master 进程并不处理网络请求,主要负责调度工作进程:加载配置、启动工作进程及非停升级。worker 进程负责处...
【图解经典算法题】如何用一行代码解决约瑟夫环问题
约瑟夫环问题算是很经典的题了,估计大家都听说过,然后我就在一次笔试中遇到了,下面我就用 3 种方法来详细讲解一下这道题,最后一种方法学了之后保证让你可以让你装逼。 问题描述:编号为 1-N 的 N 个士兵围坐在一起形成一个圆圈,从编号为 1 的士兵开始依次报数(1,2,3…这样依次报),数到 m 的 士兵会被杀死出列,之后的士兵再从 1 开始报数。直到最后剩下一士兵,求这个士兵的编号。 1、方...
吐血推荐珍藏的Visual Studio Code插件
作为一名Java工程师,由于工作需要,最近一个月一直在写NodeJS,这种经历可以说是一部辛酸史了。好在有神器Visual Studio Code陪伴,让我的这段经历没有更加困难。眼看这段经历要告一段落了,今天就来给大家分享一下我常用的一些VSC的插件。 VSC的插件安装方法很简单,只需要点击左侧最下方的插件栏选项,然后就可以搜索你想要的插件了。 下面我们进入正题 Material Theme ...
如何防止抄袭PCB电路板
目录 1、抄板是什么 2、抄板是否属于侵权 3、如何防止抄板 1、抄板是什么 抄板也叫克隆或仿制,是对设计出来的PCB板进行反向技术研究;目前全新的定义:从狭义上来说,抄板仅指对电子产品电路板PCB文件的提取还原和利用文件进行电路板克隆的过程;从广义上来说,抄板不仅包括对电路板文件提取、电路板克隆、电路板仿制等技术过程,而且包括对电路板文件进行修改(即改板)、对电子产品外形模具进行三维...
“狗屁不通文章生成器”登顶GitHub热榜,分分钟写出万字形式主义大作
一、垃圾文字生成器介绍 最近在浏览GitHub的时候,发现了这样一个骨骼清奇的雷人项目,而且热度还特别高。 项目中文名:狗屁不通文章生成器 项目英文名:BullshitGenerator 根据作者的介绍,他是偶尔需要一些中文文字用于GUI开发时测试文本渲染,因此开发了这个废话生成器。但由于生成的废话实在是太过富于哲理,所以最近已经被小伙伴们给玩坏了。 他的文风可能是这样的: 你发现,...
程序员:我终于知道post和get的区别
是一个老生常谈的话题,然而随着不断的学习,对于以前的认识有很多误区,所以还是需要不断地总结的,学而时习之,不亦说乎
《程序人生》系列-这个程序员只用了20行代码就拿了冠军
你知道的越多,你不知道的越多 点赞再看,养成习惯GitHub上已经开源https://github.com/JavaFamily,有一线大厂面试点脑图,欢迎Star和完善 前言 这一期不算《吊打面试官》系列的,所有没前言我直接开始。 絮叨 本来应该是没有这期的,看过我上期的小伙伴应该是知道的嘛,双十一比较忙嘛,要值班又要去帮忙拍摄年会的视频素材,还得搞个程序员一天的Vlog,还要写BU...
加快推动区块链技术和产业创新发展,2019可信区块链峰会在京召开
11月8日,由中国信息通信研究院、中国通信标准化协会、中国互联网协会、可信区块链推进计划联合主办,科技行者协办的2019可信区块链峰会将在北京悠唐皇冠假日酒店开幕。   区块链技术被认为是继蒸汽机、电力、互联网之后,下一代颠覆性的核心技术。如果说蒸汽机释放了人类的生产力,电力解决了人类基本的生活需求,互联网彻底改变了信息传递的方式,区块链作为构造信任的技术有重要的价值。   1...
Python 植物大战僵尸代码实现(2):植物卡片选择和种植
这篇文章要介绍的是: - 上方植物卡片栏的实现。 - 点击植物卡片,鼠标切换为植物图片。 - 鼠标移动时,判断当前在哪个方格中,并显示半透明的植物作为提示。
相关热词 c# 二进制截断字符串 c#实现窗体设计器 c#检测是否为微信 c# plc s1200 c#里氏转换原则 c# 主界面 c# do loop c#存为组套 模板 c# 停掉协程 c# rgb 读取图片
立即提问