scrapy Python:FileNotFoundError: [WinError 2] 系统找不到指定的文件

最近写了一个下载图片的爬虫,调试的时候没有任何问题,但是代码运行时间长了
以后就会报错终止

old_path=image_store+''.join(image_paths)
    new_path=image_store+image_paths[0].split('\\')[0]+'\\'+item['imagePath']
    print("new path is "+new_path)
    if not os.path.exists(new_path):
        os.mkdir(new_path)


     os.rename(old_path,new_path+'\\'+image_paths[0].split('\\')[1])


    return item

我是给这个文件做了个重命名,放到指定的文件夹,,为什么总是运行一段时间就会报错

1个回答

可能原因有两个,一是爬下来的图片名有问题,二是生成保存路径的时候,由于前面爬取文件名有问题,导致路径格式非法

加上try expect。

old_path=image_store+''.join(image_paths)
new_path=image_store+image_paths[0].split('\')[0]+'\'+item['imagePath']
print("new path is "+new_path)

if not os.path.exists(new_path):
        try:
        os.mkdir(new_path)


        os.rename(old_path,new_path+'\\'+image_paths[0].split('\\')[1])
                    except:
                        pass


return item
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
scrapy新手:Scrapy报错 报错如下 请问是什么问题导致的
请问这个问题是怎么回事?网上昨天搜了一天也没找到答案。 [scrapy] ERROR: Spider error processing <GET https://www.douban.com/doulist/1264675/> (referer: None) Traceback (most recent call last): File "F:\PythonPacket\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "F:\PythonPacket\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "F:\PythonPacket\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "F:\PythonPacket\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "F:\PythonPacket\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "F:\doubanbook\doubanbook\spiders\dbbook.py", line 22, in parse author = re.search('<div class="abstract">(.*?)<br',each.extract(),re.S).group(1) File "F:\PythonPacket\lib\site-packages\parsel\selector.py", line 251, in extract with_tail=False) File "lxml.etree.pyx", line 2624, in lxml.etree.tostring (src/lxml/lxml.etree.c:49461) File "serializer.pxi", line 105, in lxml.etree._tostring (src/lxml/lxml.etree.c:79144) LookupError: unknown encoding: 'unicode'
python scrapy 爬虫图片新手求助
求问大神 我这个data她怎么了 报错: 2020-02-07 09:24:55 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: meizitu) 2020-02-07 09:24:55 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.17763-SP0 2020-02-07 09:24:55 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'meizitu', 'NEWSPIDER_MODULE': 'meizitu.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['meizitu.spiders']} 2020-02-07 09:24:55 [scrapy.extensions.telnet] INFO: Telnet Password: 0936097982b9bcc8 2020-02-07 09:24:55 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2020-02-07 09:24:56 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-02-07 09:24:56 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] Unhandled error in Deferred: 2020-02-07 09:24:56 [twisted] CRITICAL: Unhandled error in Deferred: Traceback (most recent call last): File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 184, in crawl return self._crawl(crawler, *args, **kwargs) File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 188, in _crawl d = crawler.crawl(*args, **kwargs) File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1613, in unwindGenerator return _cancellableInlineCallbacks(gen) File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1529, in _cancellableInlineCallbacks _inlineCallbacks(None, g, status) --- <exception caught here> --- File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 86, in crawl self.engine = self._create_engine() File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 111, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "e:\python3.7\lib\site-packages\scrapy\core\engine.py", line 70, in __init__ self.scraper = Scraper(crawler) File "e:\python3.7\lib\site-packages\scrapy\core\scraper.py", line 71, in __init__ self.itemproc = itemproc_cls.from_crawler(crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 53, in from_crawler return cls.from_settings(crawler.settings, crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 34, in from_settings mwcls = load_object(clspath) File "e:\python3.7\lib\site-packages\scrapy\utils\misc.py", line 46, in load_object mod = import_module(module) File "e:\python3.7\lib\importlib\__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 724, in exec_module File "<frozen importlib._bootstrap_external>", line 860, in get_code File "<frozen importlib._bootstrap_external>", line 791, in source_to_code File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed builtins.SyntaxError: unexpected EOF while parsing (pipelines.py, line 22) 2020-02-07 09:24:56 [twisted] CRITICAL: Traceback (most recent call last): File "e:\python3.7\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 86, in crawl self.engine = self._create_engine() File "e:\python3.7\lib\site-packages\scrapy\crawler.py", line 111, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "e:\python3.7\lib\site-packages\scrapy\core\engine.py", line 70, in __init__ self.scraper = Scraper(crawler) File "e:\python3.7\lib\site-packages\scrapy\core\scraper.py", line 71, in __init__ self.itemproc = itemproc_cls.from_crawler(crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 53, in from_crawler return cls.from_settings(crawler.settings, crawler) File "e:\python3.7\lib\site-packages\scrapy\middleware.py", line 34, in from_settings mwcls = load_object(clspath) File "e:\python3.7\lib\site-packages\scrapy\utils\misc.py", line 46, in load_object mod = import_module(module) File "e:\python3.7\lib\importlib\__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 724, in exec_module File "<frozen importlib._bootstrap_external>", line 860, in get_code File "<frozen importlib._bootstrap_external>", line 791, in source_to_code File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "E:\python_work\爬虫\meizitu\meizitu\pipelines.py", line 22 f.write(data) ^ SyntaxError: unexpected EOF while parsing 代码如下: pipeline ``` import requests class MeizituPipeline(object): def process_item(self, item, spider): print("main_title:",item['main_title']) print("main_image:", item['main_image']) print("main_tags:", item['main_tags']) print("main_meta:", item['main_meta']) print("page:", item['main_pagenavi']) url = requests.get(item['main_image']) print(url) try: with open(item['main_pagenavi'] +'.jpg','wb') as f: data = url.read() f.write(data) ``` image.py ``` import scrapy from scrapy.http import response from ..items import MeizituItem class ImageSpider(scrapy.Spider): #定义Spider的名字scrapy crawl meiaitu name = 'SpiderMain' #允许爬虫的域名 allowed_domains = ['www.mzitu.com/203554'] #爬取的首页列表 start_urls = ['https://www.mzitu.com/203554'] #负责提取response的信息 #response代表下载器从start_urls中的url的到的回应 #提取的信息 def parse(self,response): #遍历所有节点 for Main in response.xpath('//div[@class = "main"]'): item = MeizituItem() #匹配所有节点元素/html/body/div[2]/div[1]/div[3]/p/a content = Main.xpath('//div[@class = "content"]') item['main_title'] = content.xpath('./h2/text()') item['main_image'] = content.xpath('./div[@class="main-image"]/p/a/img') item['main_meta'] = content.xpath('./div[@class="main-meta"]/span/text()').extract() item['main_tags'] = content.xpath('./div[@class="main-tags"]/a/text()').extract() item['main_pagenavi'] = content.xpath('./div[@class="main_pagenavi"]/span/text()').extract_first() yield item new_links = response.xpath('.//div[@class="pagenavi"]/a/@href').extract() new_link =new_links[-1] yield scrapy.Request(new_link,callback=self.parse) ``` setting ``` BOT_NAME = 'meizitu' SPIDER_MODULES = ['meizitu.spiders'] NEWSPIDER_MODULE = 'meizitu.spiders' ROBOTSTXT_OBEY = True #配置默认请求头 DEFAULT_REQUEST_HEADERS = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36", 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' } ITEM_PIPELINES = { 'meizitu.pipelines.MeizituPipeline':300, } IMAGES_STORE = 'E:\python_work\爬虫\meizitu' IMAGES_MIN_HEIGHT = 1050 IMAGES_MIN_WIDTH = 700 ```
scrapy 报错:Missing scheme in request url: h
用Python的scrapy写了一个从网页下图片的爬虫,报错:Missing scheme in request url: h 去百度了也google了都说是相对地址不完整要搞成绝对地址,我用urljoin试了没用,直接用完整的图片地址也没有用。 求大神帮助。 [code=python]import scrapy from imageSpider.items import ImagespiderItem class image_Spider(scrapy.Spider): name="imgSpider" allowed_domains=["image.baidu.com"] start_urls=["http://image.baidu.com/"] def parse(self,response): oriList=response.xpath('//div[@class="img_pic_wrap_layer"]/img/@src').extract() for each in oriList: each=response.urljoin(each) item=ImagespiderItem() item['image_urls']=each yield item[/code] [code=python]# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class ImagespiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls=scrapy.Field() images=scrapy.Field() [/code]
python爬虫:为什么用requests可以爬到数据,用scrapy爬到数据为空?
"http://detail.zol.com.cn/index.php?c=SearchList&keyword=%C8%FD%D0%C7&page=1" 用requests可以爬到数据,scrapy爬的状态码是200,但响应没有数据,什么原因?
python scrapy运行错误
scrapy安装路径为D:\Python soft,已经将D:\Python soft和D:\Python soft\Scripts加入到环境变量中了(win7,64位)。建立一个工程domz,进入到所建立的工程目录下再运行,即D:\Python soft\Scripts\tutorial,然后scrapy crawl domz,出现“scrapy 不是系统内部或外部命令,也不是可运行的程序或批处理文件“错误; 若在D:\Python soft\Scripts目录下运行scrapy crawl domz,结果出错:unknown command crawl。请问怎么解决?多谢
scrapy+selenium使用IE驱动登录网址Failed to establish a new connection: [WinError 10061]
python使用爬虫框架scrapy测试selenium工具,用谷歌浏览器能够正常登录获取cookies,换成IE浏览器和IE驱动后就一直报错.网上百度的解决方法都试了没有用(关闭代理、浏览器添加信任站点) ``` urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=14535): Max retries exceeded with url: /session/9c23e951-c94b-4cb1-a49b-65c1d44c1c90/cookie (Caused by NewConnectionError('<urllib3.co nnection.HTTPConnection object at 0x000001F543985390>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。')), ``` 求大神们指点 ![图片说明](https://img-ask.csdn.net/upload/201908/26/1566801008_398049.png) ![图片说明](https://img-ask.csdn.net/upload/201908/26/1566801037_828103.png)
请问scrapy为什么会爬取失败
C:\Users\Administrator\Desktop\新建文件夹\xiaozhu>python -m scrapy crawl xiaozhu 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: xiaozhu) 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9 .5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.5.3 (v 3.5.3:1880cb95a742, Jan 16 2017, 15:51:26) [MSC v.1900 32 bit (Intel)], pyOpenSS L 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-7-6.1 .7601-SP1 2019-10-26 11:43:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'xi aozhu', 'SPIDER_MODULES': ['xiaozhu.spiders'], 'NEWSPIDER_MODULE': 'xiaozhu.spid ers'} 2019-10-26 11:43:11 [scrapy.extensions.telnet] INFO: Telnet Password: c61bda45d6 3b8138 2019-10-26 11:43:11 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider opened 2019-10-26 11:43:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2019-10-26 11:43:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-10-26 11:43:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting ( 307) to <GET https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozh u.com%2Ffangzi%2F125535477903.html> from <GET http://bj.xiaozhu.com/fangzi/12553 5477903.html> 2019-10-26 11:43:12 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://bizve rify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2Ffangzi%2F125535477 903.html> (referer: None) 2019-10-26 11:43:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2 Ffangzi%2F125535477903.html>: HTTP status code is not handled or not allowed 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Closing spider (finished) 2019-10-26 11:43:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 529, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 725, 'downloader/response_count': 2, 'downloader/response_status_count/307': 1, 'downloader/response_status_count/400': 1, 'elapsed_time_seconds': 0.427734, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 889648), 'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/400': 1, 'log_count/DEBUG': 2, 'log_count/INFO': 11, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 461914)} 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider closed (finished)
scrapy框架下使用xpath报错??
# 初次学习使用scrapy框架,在用xpath时一使用就报错 ## 不加xpath ``` import scrapy class QiuqiuSpider(scrapy.Spider): # 爬虫的名字 name = 'qiuqiu' # 允许的域名,是一个列表,里面可以放多个,一般都做限制 allowed_domains = ['https://wengpa.com/','http://www.baidu.com/'] # 起始url,是一个列表,一般追写一个 start_urls = ['http://www.baidu.com/'] # 解析函数,重写这个方法这个函数对返回值有要求,必须返回一个可迭代的对象 def parse(self, response): print('*'*50)![图片说明](https://img-ask.csdn.net/upload/202002/05/1580889008_420133.png) ``` ![图片说明](https://img-ask.csdn.net/upload/202002/05/1580889107_475349.png) ## 添加xpath ``` import scrapy class QiuqiuSpider(scrapy.Spider): # 爬虫的名字 name = 'qiuqiu' # 允许的域名,是一个列表,里面可以放多个,一般都做限制 allowed_domains = ['https://wengpa.com/','http://www.baidu.com/'] # 起始url,是一个列表,一般只写一个 start_urls = ['http://www.baidu.com/'] # 解析函数,重写这个方法这个函数对返回值有要求,必须返回一个可迭代的对象 def parse(self, response): print('*'*50) # div_list=response.xpath('//div') # div_lists = response.xpath('//div[@id="s_fm"]') # print('div_list') for oil in div_list: print('*'*50) exit() face = oil.xpath('.//span/img/@src')[0] print(face.extract()) print('*'*50) exit() ``` ![图片说明](https://img-ask.csdn.net/upload/202002/05/1580889126_417495.png) 在scrapy shell 下能找到
python scrapy爬虫 抓取的内容只有一条,怎么破??
目标URL:http://218.92.23.142/sjsz/szxx/Index.aspx(工作需要) 主要目的是爬取网站中的信件类型、信件主题、写信时间、回复时间、回复状态以及其中链接里面的具体内容,然后保存到excel表格中。里面的链接全部都是POST方法,没有出现一个具体的链接,所以我感觉非常恼火。 目前碰到的问题: 1、 但是我只能抓到第一条的信息,后面就抓不到了。具体是这条:市长您好: 我是一名事... 2、 scrapy运行后出现的信息是: 15:01:33 [scrapy] INFO: Scrapy 1.0.3 started (bot: spider2) 2016-01-13 15:01:33 [scrapy] INFO: Optional features available: ssl, http11 2016-01-13 15:01:33 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spider2.spiders', 'FEED_URI': u'file:///F:/\u5feb\u76d8/workspace/Pythontest/src/Scrapy/spider2/szxx.csv', 'SPIDER_MODULES': ['spider2.spiders'], 'BOT_NAME': 'spider2', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'CSV'} 2016-01-13 15:01:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState 2016-01-13 15:01:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-01-13 15:01:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-01-13 15:01:38 [scrapy] INFO: Enabled item pipelines: 2016-01-13 15:01:38 [scrapy] INFO: Spider opened 2016-01-13 15:01:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-01-13 15:01:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <GET http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: None) 2016-01-13 15:01:39 [scrapy] DEBUG: Filtered duplicate request: <GET http://218.92.23.142/sjsz/szxx/Index.aspx> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <GET http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <POST http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) 2016-01-13 15:01:39 [scrapy] DEBUG: Redirecting (302) to <GET http://218.92.23.142/sjsz/szxx/GkResult.aspx?infoid=3160105094757> from <POST http://218.92.23.142/sjsz/szxx/Index.aspx> 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <GET http://218.92.23.142/sjsz/szxx/GkResult.aspx?infoid=3160105094757> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) 2016-01-13 15:01:39 [scrapy] DEBUG: Scraped from <200 http://218.92.23.142/sjsz/szxx/GkResult.aspx?infoid=3160105094757> 第一条的信息(太多了,就省略了。。。。) 2016-01-13 15:01:39 [scrapy] DEBUG: Crawled (200) <POST http://218.92.23.142/sjsz/szxx/Index.aspx> (referer: http://218.92.23.142/sjsz/szxx/Index.aspx) ………… 后面的差不多,就不写出来了 2016-01-13 15:01:41 [scrapy] INFO: Stored csv feed (1 items) in: file:///F:/快盘/workspace/Pythontest/src/Scrapy/spider2/szxx.csv 2016-01-13 15:01:41 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 56383, 'downloader/request_count': 17, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 14, 'downloader/response_bytes': 118855, 'downloader/response_count': 17, 'downloader/response_status_count/200': 16, 'downloader/response_status_count/302': 1, 'dupefilter/filtered': 120, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2016, 1, 13, 7, 1, 41, 716000), 'item_scraped_count': 1, 'log_count/DEBUG': 20, 'log_count/INFO': 8, 'request_depth_max': 14, 'response_received_count': 16, 'scheduler/dequeued': 17, 'scheduler/dequeued/memory': 17, 'scheduler/enqueued': 17, 'scheduler/enqueued/memory': 17, 'start_time': datetime.datetime(2016, 1, 13, 7, 1, 38, 670000)} 2016-01-13 15:01:41 [scrapy] INFO: Spider closed (finished) 具体的代码如下(代码写的不好,误喷): import sys, copy reload(sys) sys.setdefaultencoding('utf-8') sys.path.append("../") from scrapy.spiders import CrawlSpider from scrapy.http import FormRequest, Request from scrapy.selector import Selector from items import Spider2Item class Domeszxx(CrawlSpider): name = "szxx" allowed_domain = ["218.92.23.142"] start_urls = ["http://218.92.23.142/sjsz/szxx/Index.aspx"] item = Spider2Item() def parse(self, response): selector = Selector(response) # 获得下一页的POST参数 viewstate = ''.join(selector.xpath('//input[@id="__VIEWSTATE"]/@value').extract()[0]) eventvalidation = ''.join(selector.xpath('//input[@id="__EVENTVALIDATION"]/@value').extract()[0]) nextpage = ''.join( selector.xpath('//input[@name="ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage"]/@value').extract()) nextpage_data = { '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1$ctl12$cmdNext', '__EVENTARGUMENT': '', '__VIEWSTATE': viewstate, '__VIEWSTATEGENERATOR': '9DEFE542', '__EVENTVALIDATION': eventvalidation, 'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage } # 获得抓取当前内容的xpath xjlx = ".//*[@id='ContentPlaceHolder1_GridView1_Label2_" xjzt = ".//*[@id='ContentPlaceHolder1_GridView1_LinkButton5_" xxsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label4_" hfsj = ".//*[@id='ContentPlaceHolder1_GridView1_Label5_" nextlink = '//*[@id="ContentPlaceHolder1_GridView1_cmdNext"]/@href' # 获取当前页面公开答复的行数 listnum = len(selector.xpath('//tr')) - 2 # 获得抓取内容 for i in range(0, listnum): item_all = {} xjlx_xpath = xjlx + str(i) + "']/text()" xjzt_xpath = xjzt + str(i) + "']/text()" xxsj_xpath = xxsj + str(i) + "']/text()" hfsj_xpath = hfsj + str(i) + "']/text()" # 信件类型 item_all['xjlx'] = selector.xpath(xjlx_xpath).extract()[0].decode('utf-8').encode('gbk') # 信件主题 item_all['xjzt'] = str(selector.xpath(xjzt_xpath).extract()[0].decode('utf-8').encode('gbk')).replace('\n', '') # 写信时间 item_all['xxsj'] = selector.xpath(xxsj_xpath).extract()[0].decode('utf-8').encode('gbk') # 回复时间 item_all['hfsj'] = selector.xpath(hfsj_xpath).extract()[0].decode('utf-8').encode('gbk') # 获取二级页面中的POST参数 eventtaget = 'ctl00$ContentPlaceHolder1$GridView1$ctl0' + str(i + 2) + '$LinkButton5' content_data = { '__EVENTTARGET': eventtaget, '__EVENTARGUMENT': '', '__VIEWSTATE': viewstate, '__VIEWSTATEGENERATOR': '9DEFE542', '__EVENTVALIDATION': eventvalidation, 'ctl00$ContentPlaceHolder1$GridView1$ctl12$txtGoPage': nextpage } # 完成抓取信息的传递 yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.send_value, meta={'item_all': item_all, 'content_data': content_data}) # 进入页面中的二级页面的链接,必须利用POST方法才能提交,无法看到直接的URL,同时将本页中抓取的item和进入下一页的POST方法进行传递 # yield Request(url="http://218.92.23.142/sjsz/szxx/Index.aspx", callback=self.getcontent, # meta={'item': item_all}) # yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=content_data, # callback=self.getcontent) # 进入下一页 if selector.xpath(nextlink).extract(): yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=nextpage_data, callback=self.parse) # 将当前页面的值传递到本函数并存入类的item中 def send_value(self, response): itemx = response.meta['item_all'] post_data = response.meta['content_data'] Domeszxx.item = copy.deepcopy(itemx) yield FormRequest(url="http://218.92.23.142/sjsz/szxx/Index.aspx", formdata=post_data, callback=self.getcontent) return # 将二级链接中值抓取并存入类的item中 def getcontent(self, response): item_getcontent = { 'xfr': ''.join(response.xpath('//*[@id="lblXFName"]/text()').extract()).decode('utf-8').encode('gbk'), 'lxnr': ''.join(response.xpath('//*[@id="lblXFQuestion"]/text()').extract()).decode('utf-8').encode( 'gbk'), 'hfnr': ''.join(response.xpath('//*[@id="lblXFanswer"]/text()').extract()).decode('utf-8').encode( 'gbk')} Domeszxx.item.update(item_getcontent) yield Domeszxx.item return
python scrapy 在cmd下显示 no crawl 命令
我的scrapy安装路径为C:\program files\python2.7,已经将C:\program files\python2.7和C:\program files\python2.7\scripts加入到环境变量中了(win7,64位系统) 我建立scrapy工程的时候只能先进入到scripts中然后用scrapy startproject demo,如果像网上很多教程说的直接CMD下scrapy startproject demo,会出现错误“scrapy 不是系统内部或外部命令,也不是可运行的程序或批处理文件”,不知道是为什么呢? 战战兢兢地建立了一个过程demo,按照教程所说的用scrapy crawl demo结果出错:unknown command crawl 我看到有前辈的经验说是要进入到所建立的工程目录下再运行,所以又进入C:\program files\python2.7\scripts\demo,然后scrapy crawl demo,这回又出现“scrapy 不是系统内部或外部命令,也不是可运行的程序或批处理文件“这样子的错误。 请问这是什么原因呢?麻烦有经验的朋友帮帮忙,先谢过了~
为什么我用scrapy爬取谷歌应用市场却爬取不到内容?
我想用scrapy爬取谷歌应用市场,代码没有报错,但是却爬取不到内容,这是为什么? ``` # -*- coding: utf-8 -*- import scrapy # from scrapy.spiders import CrawlSpider, Rule # from scrapy.linkextractors import LinkExtractor from gp.items import GpItem # from html.parser import HTMLParser as SGMLParser import requests class GoogleSpider(scrapy.Spider): name = 'google' allowed_domains = ['https://play.google.com/'] start_urls = ['https://play.google.com/store/apps/'] ''' rules = [ Rule(LinkExtractor(allow=("https://play\.google\.com/store/apps/details",)), callback='parse_app', follow=True), ] ''' def parse(self, response): selector = scrapy.Selector(response) urls = selector.xpath('//a[@class="LkLjZd ScJHi U8Ww7d xjAeve nMZKrb id-track-click"]/@href').extract() link_flag = 0 links = [] for link in urls: links.append(link) for each in urls: yield scrapy.Request(links[link_flag], callback=self.parse_next, dont_filter=True) link_flag += 1 def parse_next(self, response): selector = scrapy.Selector(response) app_urls = selector.xpath('//div[@class="details"]/a[@class="title"]/@href').extract() print(app_urls) urls = [] for url in app_urls: url = "http://play.google.com" + url print(url) urls.append(url) link_flag = 0 for each in app_urls: yield scrapy.Request(urls[link_flag], callback=self.parse_app, dont_filter=True) link_flag += 1 def parse_app(self, response): item = GpItem() item['app_url'] = response.url item['app_name'] = response.xpath('//div[@itemprop="name"]').xpath('text()').extract() item['app_icon'] = response.xpath('//img[@itempro="image"]/@src') item['app_developer'] = response.xpath('//') print(response.text) yield item ``` terminal运行信息如下: ``` BettyMacbookPro-764:gp zhanjinyang$ scrapy crawl google 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: gp) 2019-11-12 08:46:45 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.1 (default, Dec 14 2018, 13:28:58) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.4.2, Platform Darwin-18.5.0-x86_64-i386-64bit 2019-11-12 08:46:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'gp', 'NEWSPIDER_MODULE': 'gp.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['gp.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'} 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet Password: b2d7dedf1f4a91eb 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-11-12 08:46:45 [scrapy.middleware] INFO: Enabled item pipelines: ['gp.pipelines.GpPipeline'] 2019-11-12 08:46:45 [scrapy.core.engine] INFO: Spider opened 2019-11-12 08:46:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-12 08:46:45 [py.warnings] WARNING: /anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://play.google.com/ in allowed_domains. warnings.warn(message, URLWarning) 2019-11-12 08:46:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-12 08:46:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/robots.txt> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/store/apps/> (referer: None) 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-12 08:46:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 810, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 232419, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 11, 12, 8, 46, 46, 474543), 'log_count/DEBUG': 2, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'memusage/max': 58175488, 'memusage/startup': 58175488, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2019, 11, 12, 8, 46, 45, 562775)} 2019-11-12 08:46:46 [scrapy.core.engine] INFO: Spider closed (finished) ``` 求助!!!
scrapy爬虫出现 DEBUG: Crawled (404)
为什么会出现解析页面错误呢? ``` 2019-04-17 16:14:46 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.xxx.xx/xxgk/xwzx/zwdt/None> (referer: http://www.xxx.xx/xxgk/xwzx/zwdt/index_1.htm) 2019-04-17 16:14:46 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.xxx.xx/xxgk/xwzx/zwdt/None> (referer: http://www.xxx.xx/xxgk/xwzx/zwdt/index_1.htm) 2019-04-17 16:14:46 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.xxx.xx/xxgk/xwzx/zwdt/None> (referer: http://www.xxx.xx/xxgk/xwzx/zwdt/index_1.htm) 2019-04-17 16:14:46 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.xxx.xx/xxgk/xwzx/zwdt/None> (referer: http://www.xxx.xx/xxgk/xwzx/zwdt/index_1.htm) 2019-04-17 16:14:46 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.xxx.xx/xxgk/xwzx/zwdt/None> (referer: http://www.xxx.xx/xxgk/xwzx/zwdt/index_1.htm) ``` 代码在上边,解析一个页面为什么返回了 页面的None,?是网站监测到爬虫了吗?scrapy需要再哪更改headers呢 我尝试了在主程序里更改 然后yield 然而失败了 谢谢各位
利用Scrapy框架爬虫时出现报错ModuleNotFoundError: No module named 'scrapytest.NewsItems'?
``` #引入文件 import scrapy class MySpider(scrapy.Spider): #用于区别Spider name = "MySpider" #允许访问的域 allowed_domains = [] #爬取的地址 start_urls = [] #爬取方法 def parse(self, response): pass class NewsItem(scrapy.Item): #新闻标题 title = scrapy.Field() #新闻url url = scrapy.Field() #发布时间 time = scrapy.Field() #新闻内容 introduction = scrapy.Field() #定义一个item news = NewsItem() #赋值 news['title'] = "第六届年会在我校成功举办" #取值 news['title'] news.get('title') #获取全部键 news.keys() #获取全部值 news.items() import scrapy #引入容器 from scrapytest.NewsItems import NewsItem class MySpider(scrapy.Spider): #设置name name = "MySpider" #设定域名 allowed_domains = ["xgxy.hbue.edu.cn"] #填写爬取地址 start_urls = ["http://xgxy.hbue.edu.cn/2627/list.htm"] #编写爬取方法 def parse(self, response): #实例一个容器保存爬取的信息 item = NewsItem() ``` 显示错误为: ModuleNotFoundError Traceback (most recent call last) <ipython-input-17-17f981d92f22> in <module> 1 import scrapy 2 #引入容器 ----> 3 from scrapytest.NewsItems import NewsItem 4 5 class MySpider(scrapy.Spider): ModuleNotFoundError: No module named 'scrapytest.NewsItems' 希望大佬帮忙看一下,出了什么问题,万分感谢!
scrapy代码我这段是有错误吗?无法转成csv,创建的文档只有0k
# -*- coding: utf-8 -*- import scrapy from mySpider.items import MyspiderItem class MyspiderSpider(scrapy.Spider): name = 'myspider' allowed_domains = ['itcast.cn'] start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajavaee'] def parse(self, response): teacher_list = response.xpath('//div[@class="li_txt"]') teacherItem = [] for each in teacher_list: item = MyspiderItem() name = each.xpath('./h3/text()').extract() title = each.xpath('./h4/text()').extract() info = each.xpath('./p/text()').extract() item['name'] = name[0].encode("gbk") item['title'] = title[0].encode("gbk") item['info'] = info[0].encode("gbk") teacherItem.append(item) # print(name[0]) # print(title[0]) # print(info[0]) # pass return teacherItem 然后又用scrapy crawl spider -o spider.csv
scrapy配置问题,求大家帮忙啊
配置scrapy 我是按照http://blog.csdn.net/wukaibo1986/article/details/8167590配置的 创建项目可以 但是运行项目的时候报错,做的demo是按照 http://www.oschina.net/translate/scrapy-demo做的 求解释: E:\爬虫\tutorial>scrapy crawl dmoz 2013-11-20 11:09:50+0800 [scrapy] INFO: Scrapy 0.20.0 started (bot: tutorial) 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Optional features available: ssl, http11 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'} 2013-11-20 11:09:50+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState Traceback (most recent call last): File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "C:\Python27\lib\runpy.py", line 72, in _run_code exec code in run_globals File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 168, in <module> execute() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 143, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 89, in _run_print_help func(*a, **kw) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\cmdline.py", line 150, in _run_command cmd.run(args, opts) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\commands\crawl.py", line 50, in run self.crawler_process.start() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 92, in start if self.start_crawling(): File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 124, in start_crawling return self._start_crawler() is not None File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 139, in _start_crawler crawler.configure() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\crawler.py", line 47, in configure self.engine = ExecutionEngine(self, self._spider_closed) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\engine.py", line 63, in __init__ self.downloader = Downloader(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\__init__.py", line 73, in __init__ self.handlers = DownloadHandlers(crawler) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 18, in __init__ cls = load_object(clspath) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\utils\misc.py", line 40, in load_object mod = import_module(module) File "C:\Python27\lib\importlib\__init__.py", line 37, in import_module __import__(name) File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\s3.py", line 4, in <module> from .http import HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http.py", line 5, in <module> from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 17, in <module> from scrapy.responsetypes import responsetypes File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 113, in <module> responsetypes = ResponseTypes() File "C:\Python27\lib\site-packages\scrapy-0.20.0-py2.7.egg\scrapy\responsetypes.py", line 34, in __init__ self.mimetypes = MimeTypes() File "C:\Python27\lib\mimetypes.py", line 66, in __init__ init() File "C:\Python27\lib\mimetypes.py", line 358, in init db.read_windows_registry() File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry for subkeyname in enum_types(hkcr): File "C:\Python27\lib\mimetypes.py", line 249, in enum_types ctype = ctype.encode(default_encoding) # omit in 3.x! UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 9: ordinal not in range(128)
scrapy出错twisted.python.failure.Failure twisted.internet.error
我是跟这网上视频写的 ``` import scrapy class QsbkSpider(scrapy.Spider): name = 'qsbk' allowed_domains = ['www.qiushibaike.com/'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): print('='*10) print(response) print('*'*10) ``` 出现 ![图片说明](https://img-ask.csdn.net/upload/201908/06/1565066344_519008.png) 请求头改了还是不行,用requests库爬取又可以
scrapy的unknown command crawl错误
scrapy安装路径为D:\Python soft,已经将D:\Python soft和D:\Python soft\Scripts加入到环境变量中了(win7,64位)。建立一个工程domz,进入到所建立的工程目录下再运行,即D:\Python soft\Scripts\tutorial,然后scrapy crawl domz,出现“scrapy 不是系统内部或外部命令,也不是可运行的程序或批处理文件“错误; 若在D:\Python soft\Scripts目录下运行scrapy crawl domz,结果出错:unknown command crawl。请问怎么解决?多谢
运行scrapy项目报错:ImportError:DLL load failed:操作系统无法运行
![图片说明](https://img-ask.csdn.net/upload/201801/16/1516088805_718484.png) 前几天还一直正常,后来不知怎么就出现这个错误,产生的原因是什么,我该如何解决呢? Python版本是:Anaconda3
python2有easy_install 但是却无法使用是怎么回事?
按照该连接操作:https://blog.csdn.net/qq_40678222/article/details/82734870 ``` hellopython@ubuntu:~$ easy_install --version setuptools 41.6.0 from /home/hellopython/.local/lib/python2.7/site-packages (Python 2.7) hellopython@ubuntu:~$ sudo easy_install virtualenvwrapper sudo: easy_install: command not found hellopython@ubuntu:~$ cd /home/hellopython/.local/lib/python2.7/site-packages hellopython@ubuntu:~/.local/lib/python2.7/site-packages$ ls decorator-4.4.1.dist-info retry decorator.py retry-0.9.2.dist-info decorator.pyc scrapy easy_install.py Scrapy-1.8.0.dist-info easy_install.pyc selenium ``` 请问这是为什么?python2里有easy_install.py ,但是却无法使用,是因为默认用python3解释器吗?怎么临时指定python2来执行呢?(不要长期)
130 个相见恨晚的超实用网站,一次性分享出来
相见恨晚的超实用网站 持续更新中。。。
字节跳动视频编解码面经
三四月份投了字节跳动的实习(图形图像岗位),然后hr打电话过来问了一下会不会opengl,c++,shador,当时只会一点c++,其他两个都不会,也就直接被拒了。 七月初内推了字节跳动的提前批,因为内推没有具体的岗位,hr又打电话问要不要考虑一下图形图像岗,我说实习投过这个岗位不合适,不会opengl和shador,然后hr就说秋招更看重基础。我当时想着能进去就不错了,管他哪个岗呢,就同意了面试...
win10系统安装教程(U盘PE+UEFI安装)
一、准备工作 u盘,电脑一台,win10原版镜像(msdn官网) 二、下载wepe工具箱 极力推荐微pe(微pe官方下载) 下载64位的win10 pe,使用工具箱制作启动U盘打开软件, 选择安装到U盘(按照操作无需更改) 三、重启进入pe系统 1、关机后,将U盘插入电脑 2、按下电源后,按住F12进入启动项选择(技嘉主板是F12) 选择需要启...
程序员必须掌握的核心算法有哪些?
由于我之前一直强调数据结构以及算法学习的重要性,所以就有一些读者经常问我,数据结构与算法应该要学习到哪个程度呢?,说实话,这个问题我不知道要怎么回答你,主要取决于你想学习到哪些程度,不过针对这个问题,我稍微总结一下我学过的算法知识点,以及我觉得值得学习的算法。这些算法与数据结构的学习大多数是零散的,并没有一本把他们全部覆盖的书籍。下面是我觉得值得学习的一些算法以及数据结构,当然,我也会整理一些看过...
Python——画一棵漂亮的樱花树(不同种樱花+玫瑰+圣诞树喔)
最近翻到一篇知乎,上面有不少用Python(大多是turtle库)绘制的树图,感觉很漂亮,我整理了一下,挑了一些我觉得不错的代码分享给大家(这些我都测试过,确实可以生成) one 樱花树 动态生成樱花 效果图(这个是动态的): 实现代码 import turtle as T import random import time # 画樱花的躯干(60,t) def Tree(branch, ...
大学四年自学走来,这些私藏的实用工具/学习网站我贡献出来了
大学四年,看课本是不可能一直看课本的了,对于学习,特别是自学,善于搜索网上的一些资源来辅助,还是非常有必要的,下面我就把这几年私藏的各种资源,网站贡献出来给你们。主要有:电子书搜索、实用工具、在线视频学习网站、非视频学习网站、软件下载、面试/求职必备网站。 注意:文中提到的所有资源,文末我都给你整理好了,你们只管拿去,如果觉得不错,转发、分享就是最大的支持了。 一、电子书搜索 对于大部分程序员...
《奇巧淫技》系列-python!!每天早上八点自动发送天气预报邮件到QQ邮箱
将代码部署服务器,每日早上定时获取到天气数据,并发送到邮箱。 也可以说是一个小人工智障。 思路可以运用在不同地方,主要介绍的是思路。
致 Python 初学者
欢迎来到“Python进阶”专栏!来到这里的每一位同学,应该大致上学习了很多 Python 的基础知识,正在努力成长的过程中。在此期间,一定遇到了很多的困惑,对未来的学习方向感到迷茫。我非常理解你们所面临的处境。我从2007年开始接触 python 这门编程语言,从2009年开始单一使用 python 应对所有的开发工作,直至今天。回顾自己的学习过程,也曾经遇到过无数的困难,也曾经迷茫过、困惑过。开办这个专栏,正是为了帮助像我当年一样困惑的 Python 初学者走出困境、快速成长。希望我的经验能真正帮到你
Java描述设计模式(19):模板方法模式
本文源码:GitHub·点这里 || GitEE·点这里 一、生活场景 通常一款互联网应用的开发流程如下:业务需求,规划产品,程序开发,测试交付。现在基于模板方法模式进行该过程描述。 public class C01_InScene { public static void main(String[] args) { DevelopApp developApp = n...
加快推动区块链技术和产业创新发展,2019可信区块链峰会在京召开
11月8日,由中国信息通信研究院、中国通信标准化协会、中国互联网协会、可信区块链推进计划联合主办,科技行者协办的2019可信区块链峰会将在北京悠唐皇冠假日酒店开幕。   区块链技术被认为是继蒸汽机、电力、互联网之后,下一代颠覆性的核心技术。如果说蒸汽机释放了人类的生产力,电力解决了人类基本的生活需求,互联网彻底改变了信息传递的方式,区块链作为构造信任的技术有重要的价值。   1...
C语言魔塔游戏
很早就很想写这个,今天终于写完了。 游戏截图: 编译环境: VS2017 游戏需要一些图片,如果有想要的或者对游戏有什么看法的可以加我的QQ 2985486630 讨论 下面我来介绍一下游戏的主要功能和实现方式 首先是玩家的定义,使用结构体,这个名字是可以自己改变的 struct gamerole { char name[20] = "黑蛋"; //玩家名字 int...
第三个java程序(表白小卡片)
前言: &nbsp;向女神表白啦,作为一个程序员,当然也有爱情啦。只不过,虽然前面两个程序都只是学习了基础的语法结构和向量哈希表。这里涉及的是Swing,awt图形用户界面和一点文件输入输出流的知识。 &nbsp; 表白代码如下: 另附:里面的音乐和图片可以放在一个自己创建的包里面,也可以放在src里面,或者使用绝对路径。至于布局,我自己的使用的是简单的排班,简单的继承。后面的程序会慢慢实现。 ...
8年经验面试官详解 Java 面试秘诀
作者 |胡书敏 责编 | 刘静 出品 | CSDN(ID:CSDNnews) 本人目前在一家知名外企担任架构师,而且最近八年来,在多家外企和互联网公司担任Java技术面试官,前后累计面试了有两三百位候选人。在本文里,就将结合本人的面试经验,针对Java初学者、Java初级开发和Java开发,给出若干准备简历和准备面试的建议。 Java程序员准备和投递简历的实...
知乎高赞:中国有什么拿得出手的开源软件产品?(整理自本人原创回答)
知乎高赞:中国有什么拿得出手的开源软件产品? 在知乎上,有个问题问“中国有什么拿得出手的开源软件产品(在 GitHub 等社区受欢迎度较好的)?” 事实上,还不少呢~ 本人于2019.7.6进行了较为全面的回答,对这些受欢迎的 Github 开源项目分类整理如下: 分布式计算、云平台相关工具类 1.SkyWalking,作者吴晟、刘浩杨 等等 仓库地址: apache/skywalking 更...
化繁为简 - 腾讯计费高一致TDXA的实践之路
导语:腾讯计费是孵化于支撑腾讯内部业务千亿级营收的互联网计费平台,在如此庞大的业务体量下,腾讯计费要支撑业务的快速增长,同时还要保证每笔交易不错账。采用最终一致性或离线补...
Linux网络服务-----实验---PXE和Kickstart的无人值守装机
目录 一.PXE的原理 二.kickstart的原理 三.PXE与kickstart的结合使用自动装机 一.PXE的原理 PXE(preboot execute environment,预启动执行环境)是由Intel公司开发的最新技术,工作于Client/Server的网络模式,支持工作站通过网络从远端服务器下载映像,并由支持通过网络启动操作系统,再启动过程中,终端要求服务器分配IP地址...
究竟你适不适合买Mac?
我清晰的记得,刚买的macbook pro回到家,开机后第一件事情,就是上了淘宝网,花了500元钱,找了一个上门维修电脑的师傅,上门给我装了一个windows系统。。。。。。 表砍我。。。 当时买mac的初衷,只是想要个固态硬盘的笔记本,用来运行一些复杂的扑克软件。而看了当时所有的SSD笔记本后,最终决定,还是买个好(xiong)看(da)的。 已经有好几个朋友问我mba怎么样了,所以今天尽量客观...
A*搜索算法概述
编者按:本文作者奇舞团前端开发工程师魏川凯。A*搜索算法(A-star search algorithm)是一种常见且应用广泛的图搜索和寻径算法。A*搜索算法是通过使用启...
程序员写了一个新手都写不出的低级bug,被骂惨了。
这种新手都不会范的错,居然被一个工作好几年的小伙子写出来,差点被当场开除了。
Java工作4年来应聘要16K最后没要,细节如下。。。
前奏: 今天2B哥和大家分享一位前几天面试的一位应聘者,工作4年26岁,统招本科。 以下就是他的简历和面试情况。 基本情况: 专业技能: 1、&nbsp;熟悉Sping了解SpringMVC、SpringBoot、Mybatis等框架、了解SpringCloud微服务 2、&nbsp;熟悉常用项目管理工具:SVN、GIT、MAVEN、Jenkins 3、&nbsp;熟悉Nginx、tomca...
2020年,冯唐49岁:我给20、30岁IT职场年轻人的建议
点击“技术领导力”关注∆每天早上8:30推送 作者|Mr.K 编辑| Emma 来源|技术领导力(ID:jishulingdaoli) 前天的推文《冯唐:职场人35岁以后,方法论比经验重要》,收到了不少读者的反馈,觉得挺受启发。其实,冯唐写了不少关于职场方面的文章,都挺不错的。可惜大家只记住了“春风十里不如你”、“如何避免成为油腻腻的中年人”等不那么正经的文章。 本文整理了冯...
从顶级黑客到上市公司老板
一看标题,很多老读者就知道我在写什么了。今天Ucloud成功上市,季昕华成为我所熟悉的朋友里又双叒叕一个成功上市的案例。我们认识大概是十五年多吧,如果没记错,第一次见面应该是2004年,...
蓝桥杯知识点汇总:基础知识和常用算法
文章目录基础语法部分:算法竞赛常用API:算法部分数据结构部分 此系列包含蓝桥杯绝大部分所考察的知识点,以及真题题解~ 基础语法部分: 备战蓝桥杯java(一):一般输入输出 和 快速输入输(BufferedReader&amp;BufferedWrite) 备战蓝桥杯java(二):java编程规范和常用数据类型 备战蓝桥杯java(三):常用功能符以及循环结构和分支结构 备战蓝桥杯java(四...
作为一个程序员,CPU的这些硬核知识你必须会!
CPU对每个程序员来说,是个既熟悉又陌生的东西? 如果你只知道CPU是中央处理器的话,那可能对你并没有什么用,那么作为程序员的我们,必须要搞懂的就是CPU这家伙是如何运行的,尤其要搞懂它里面的寄存器是怎么一回事,因为这将让你从底层明白程序的运行机制。 随我一起,来好好认识下CPU这货吧 把CPU掰开来看 对于CPU来说,我们首先就要搞明白它是怎么回事,也就是它的内部构造,当然,CPU那么牛的一个东...
破14亿,Python分析我国存在哪些人口危机!
一、背景 二、爬取数据 三、数据分析 1、总人口 2、男女人口比例 3、人口城镇化 4、人口增长率 5、人口老化(抚养比) 6、各省人口 7、世界人口 四、遇到的问题 遇到的问题 1、数据分页,需要获取从1949-2018年数据,观察到有近20年参数:LAST20,由此推测获取近70年的参数可设置为:LAST70 2、2019年数据没有放上去,可以手动添加上去 3、将数据进行 行列转换 4、列名...
强烈推荐10本程序员在家读的书
很遗憾,这个春节注定是刻骨铭心的,新型冠状病毒让每个人的神经都是紧绷的。那些处在武汉的白衣天使们,尤其值得我们的尊敬。而我们这些窝在家里的程序员,能不外出就不外出,就是对社会做出的最大的贡献。 有些读者私下问我,窝了几天,有点颓丧,能否推荐几本书在家里看看。我花了一天的时间,挑选了 10 本我最喜欢的书,你可以挑选感兴趣的来读一读。读书不仅可以平复恐惧的压力,还可以对未来充满希望,毕竟苦难终将会...
Linux自学篇——linux命令英文全称及解释
man: Manual 意思是手册,可以用这个命令查询其他命令的用法。 pwd:Print working directory 意思是密码。 su:Swith user 切换用户,切换到root用户 cd:Change directory 切换目录 ls:List files 列出目录下的文件 ps:Process Status 进程状态 mkdir:Make directory ...
Python实战:抓肺炎疫情实时数据,画2019-nCoV疫情地图
今天,群里白垩老师问如何用python画武汉肺炎疫情地图。白垩老师是研究海洋生态与地球生物的学者,国家重点实验室成员,于不惑之年学习python,实为我等学习楷模。先前我并没有关注武汉肺炎的具体数据,也没有画过类似的数据分布图。于是就拿了两个小时,专门研究了一下,遂成此文。
疫情数据接口api
返回json示例 { "errcode":0,//0标识接口正常 "data":{ "date":"2020-01-30 07:47:23",//实时更新时间 "diagnosed":7736,//确诊人数 "suspect":12167,//疑是病例人数 "death":170,//死亡人数 "cur...
智力题(程序员面试经典)
NO.1  有20瓶药丸,其中19瓶装有1克/粒的药丸,余下一瓶装有1.1克/粒的药丸。给你一台称重精准的天平,怎么找出比较重的那瓶药丸?天平只能用一次。 解法 有时候,严格的限制条件有可能反倒是解题的线索。在这个问题中,限制条件是天平只能用一次。 因为天平只能用一次,我们也得以知道一个有趣的事实:一次必须同时称很多药丸,其实更准确地说,是必须从19瓶拿出药丸进行称重。否则,如果跳过两瓶或更多瓶药...
相关热词 c# 识别回车 c#生成条形码ean13 c#子控制器调用父控制器 c# 写大文件 c# 浏览pdf c#获取桌面图标的句柄 c# list反射 c# 句柄 进程 c# 倒计时 线程 c# 窗体背景色
立即提问