python3 scrapy Request 请求时怎么保持headers 的参数首字母不大写

python3 scrapy Request 请求时,scrapy 会自动将headers 中的参数 格式化,使其保持首字母大写,下划线等特殊符号后第一个字母大写。但现在有个问题 我要往服务端传一个headers的参数,但参数本身没有大写,经过scrapy 请求后参数变为首字母大写,服务器端根本不认这个参数,我就想问下有谁知道scrapy,Request 有不处理headers的方法吗? 但使用requests请求时,而不是用scrapy.Request时,headers 是没有变化的。he图片说明

这是headers 请求之前的

图片说明

这是抓包抓到的请求头

1个回答

在spider文件中spider类的上面写下不希望首字母大写的header

from twisted.web.http_headers import Headers as TwistedHeaders

TwistedHeaders._caseMappings.update({
    b'uuid': b'uuid',
    b'content-type': b'content-type',
})
qq_29581601
窃语 强 ,请测有效
大约一年之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

scrapy在settdings.py中已经设置好了DEFAULT_REQUEST_HEADERS,在发起请求的时候应该怎么写headers?

求解关于scrapy请求会自动携带上一次请求中的set-cookie字段的问题

如图所示,在request进入downloader之前,headers里是没有cookie字段的,但是在下载结束后,request的headers字段里出现了cookie字段,且该cookie内容为上一次请求返回的set-cookie的内容,但是这里我其实是不需要这个request携带任何cookie进行请求的,尝试过在settting里设置cookies-enabled=False,这样虽然request.headers里的确没有cookie可以得到我希望的请求头,但是后续需要携带cookie的请求就没办法继续正常请求了,请问如何设置本次请求不携带上一次请求的xin'xi![图片说明](https://img-ask.csdn.net/upload/201904/24/1556070657_534485.png)

Scrapy FormRequest函数中的meta参数值应该如何设置?

我用scrapy进行爬虫,解析函数部分另有下一级回调函数,代码如下: ``` item = SoccerDataItem() for i in range(1, 8): item['player' + str(i + 1)] = players[i] for j in range(1, 8): home_sub_list = response.xpath('//div[@class="left"]//li[@class="pl10"]') if home_sub_list[j - 1].xpath('./span/img[contains(@src,"subs_up")]'): item['player' + str(j)]['name'] = home_sub_list[j - 1].xpath('./div[@class="ml10"]').xpath('string(.)').re_first('\d{1,2}\xa0\xa0(.*)') item['player' + str(j)]['team_stand'] = 1 item['player' + str(j)]['is_startup'] = 0 item['player' + str(j)]['is_subs_up'] = 1 item['player' + str(j)]['subs_up_time'] = home_sub_list[j].xpath('./span/img[contains(@src,"subs_up")]/following-sibling::span').xpath('string(.)').extract_first(default='') yield scrapy.FormRequest(url=data_site, formdata=formdata, meta={'player': item['player' + str(j)]}, callback=self.parse_data) else: item['player' + str(j)]['name'] = home_sub_list[j-1].xpath('./div[@class="ml10"]').xpath('string(.)').re_first('\d{1,2}\xa0\xa0(.*)') item['player' + str(j)]['team_stand'] = 1 item['player' + str(j)]['is_startup'] = 0 item['player' + str(j)]['is_subs_up'] = 0 ``` 然而运行后一直在报错: ``` callback=self.parse_data) File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__ querystr = _urlencode(items, self.encoding) File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 66, in _urlencode for k, vs in seq File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\http\request\form.py", line 67, in <listcomp> for v in (vs if is_listlike(vs) else [vs])] File "c:\users\pc1\appdata\local\programs\python\python36-32\lib\site-packages\scrapy\utils\python.py", line 119, in to_bytes 'object, got %s' % type(text).__name__) TypeError: to_bytes must receive a unicode, str or bytes object, got int ``` 据本人百度得知,meta当中的键值对的值应为字符串,字节等类型,这正是当我传入字典类型时报错的原因。 可是,请问我应该如何修改此处呢? PS:本人所用编程语言为Python,排版可能会引起读者不适,望谅解!

scrapy运行爬虫时报错Missing scheme in request url

scrapy刚入门小白一枚。用网上的案例代码来玩一玩,案例是http://blog.csdn.net/czl389/article/details/77278166 中的爬取嘻哈歌词。这个案例下有三只爬虫,分别是songurls,lyrics和songinfo。我用songurls爬虫能从虾米音乐上爬取了url并保存在SongUrls.csv中,但是在用lyrics爬虫的时候会报错。信息如下 **D:\xiami2\xiami2>scrapy crawl lyrics -o Lyrics.csv 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiami2) 2017-10-21 21:13:29 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiami2.spiders', 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 4.0; Trident/3.0)', 'FEED_URI': 'Lyrics.csv', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 0.2, 'SPIDER_MODULES': ['xiami2.spiders'], 'BOT_NAME': 'xiami2'} 2017-10-21 21:13:29 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-21 21:13:31 [scrapy.middleware] INFO: Enabled item pipelines: ['xiami2.pipelines.Xiami2Pipeline'] 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider opened 2017-10-21 21:13:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-21 21:13:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-21 21:13:31 [scrapy.core.engine] ERROR: Error while obtaining start requests Traceback (most recent call last): File "d:\python3.5\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request request = next(slot.start_requests) File "d:\python3.5\lib\site-packages\scrapy\spiders\__init__.py", line 83, in start_requests yield Request(url, dont_filter=True) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__ self._set_url(url) File "d:\python3.5\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Closing spider (finished) 2017-10-21 21:13:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 567323), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2017, 10, 21, 13, 13, 31, 536236)} 2017-10-21 21:13:31 [scrapy.core.engine] INFO: Spider closed (finished) _------------------------------分割线--------------------------------------_ 我去查看了一下_init_.py,发现如下语句。 if ':' not in self._url: raise ValueError('Missing scheme in request url: %s' % self._url) 网上的解决方法看了一些,都没有能解决我的问题的,因此在此讨教,望大家指点一二(真没C币了)。提问次数不多,若有格式方面缺陷还请包含。 另附上代码。 #songurls.py import scrapy import re from scrapy.spiders import CrawlSpider, Rule from ..items import SongUrlItem class SongurlsSpider(scrapy.Spider): name = 'songurls' allowed_domains = ['xiami.com'] #将page/1到page/401,这些链接放进start_urls start_url_list=[] url_fixed='http://www.xiami.com/song/tag/Hip-Hop/page/' #将range范围扩大为1-401,获得所有页面 for i in range(1,402): start_url_list.extend([url_fixed+str(i)]) start_urls=start_url_list def parse(self,response): urls=response.xpath('//*[@id="wrapper"]/div[2]/div/div/div[2]/table/tbody/tr/td[2]/a[1]/@href').extract() for url in urls: song_url=response.urljoin(url) url_item=SongUrlItem() url_item['song_url']=song_url yield url_item ------------------------------分割线-------------------------------------- #lyrics.py import scrapy import re class LyricsSpider(scrapy.Spider): name = 'lyrics' allowed_domains = ['xiami.com'] song_url_file='SongUrls.csv' def __init__(self, *args, **kwargs): #从song_url.csv 文件中读取得到所有歌曲url f = open(self.song_url_file,"r") lines = f.readlines() #这里line[:-1]的含义是每行末尾都是一个换行符,要去掉 #这里in lines[1:]的含义是csv第一行是字段名称,要去掉 song_url_list=[line[:-1] for line in lines[1:]] f.close() while '\n' in song_url_list: song_url_list.remove('\n') self.start_urls = song_url_list#[:100]#删除[:100]之后爬取全部数据 def parse(self,response): lyric_lines=response.xpath('//*[@id="lrc"]/div[1]/text()').extract() lyric='' for lyric_line in lyric_lines: lyric+=lyric_line #print lyric lyricItem=LyricItem() lyricItem['lyric']=lyric lyricItem['song_url']=response.url yield lyricItem songinfo因为还没有用到所以不重要。 ------------------------------分割线-------------------------------------- #items.py import scrapy class SongUrlItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 class LyricItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() lyric=scrapy.Field() #歌词 song_url=scrapy.Field() #歌曲链接 class SongInfoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() song_url=scrapy.Field() #歌曲链接 song_title=scrapy.Field() #歌名 album=scrapy.Field() #专辑 #singer=scrapy.Field() #歌手 language=scrapy.Field() #语种 ------------------------------分割线-------------------------------------- 在middleware下加了几行: sleep_seconds = 0.2 # 模拟点击后休眠3秒,给出浏览器取得响应内容的时间 default_sleep_seconds = 1 # 无动作请求休眠的时间 def process_request(self, request, spider): spider.logger.info('--------Spider request processed: %s' % spider.name) page = None driver = webdriver.PhantomJS() spider.logger.info('--------request.url: %s' % request.url) driver.get(request.url) driver.implicitly_wait(0.2) # 仅休眠数秒加载页面后返回内容 time.sleep(self.sleep_seconds) page = driver.page_source driver.close() return HtmlResponse(request.url, body=page, encoding='utf-8', request=request) ------------------------------分割线-------------------------------------- setting中加了几行也改了几行: from faker import Factory f = Factory.create() USER_AGENT = f.user_agent() DOWNLOAD_DELAY = 0.2 DEFAULT_REQUEST_HEADERS = { 'Host': 'www.xiami.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'no-cache', 'Connection': 'Keep-Alive', } ITEM_PIPELINES = { 'xiami2.pipelines.Xiami2Pipeline': 300, }

为什么我直接用requests爬网页可以,但用scrapy不行?

``` class job51(): def __init__(self): self.headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start(self): html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers) self.parse(html) def parse(self,response): tree=lxml.etree.HTML(response.text) resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href') print (resume_url[0] ``` 能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面? ``` class job51(Spider): name = "job51" #allowed_domains = ["my.51job.com"] start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"] headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cache-Control': 'max-age=0', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Cookie':'' } def start_requests(self): yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse) def parse(self,response): #tree=lxml.etree.HTML(text) selector=Selector(response) print ("<<<<<<<<<<<<<<<<<<<<<",response.text) resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href') print (">>>>>>>>>>>>",resume_url) ``` 输出的结果: scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'} 2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened 2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None) 2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) <<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script> >>>>>>>>>>>> [] 2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None) Traceback (most recent call last): File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse yield Request(resume_url[0],headers=self.headers,callback=self.getResume) File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 628, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 5743, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)} 2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)

scrapy不可以访问requests却可以

scrapy不可以访问requests却可以,谁有类似的解决经验么

scrapy 爬取遇到问题Filtered duplicate

用scrapy请求站点 http://bigfile.co.kr 的时候,显示Filtered duplicate request:no more duplicates错误,然后就结束了,加上dont_filter=True,重新运行,结果一直死循环,无法结束,也不能爬到东西,有没有大神看一下 ```python name = 'WebSpider' start_urls = ['http://bigfile.co.kr'] headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", 'Referer': 'http://www.baidu.com/', "Upgrade-Insecure-Requests": 1, "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" } def start_requests(self): request = scrapy.Request(url=self.start_urls[0], headers=self.headers, callback=self.parse) request.meta['url'] = self.start_urls[0] yield request ```

用scrapy模拟登录知乎,返回“登录过于频繁,请稍后重试”

之前用requests模拟登录,加载到本地一直自动刷新。用scrapy登陆后,返回200和“登录过于频繁,请稍后重试”,求助dalao,搞了好多天了。代码如下: # -*- coding: utf-8 -*- import scrapy import json import re class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['http://www.zhihu.com/'] agent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0" header = { "HOST": "www.zhihu.com", "REFERER": "https://www.zhihu.com", "User-Agent": agent, "Connection":"Keep-Alive" } def parse(self, response): pass def start_requests(self): return [scrapy.Request("https://www.zhihu.com/signin?next=/",headers=self.header,callback=self.login)] def login(self,response): match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text, re.DOTALL) xsrf = '' if match_obj: xsrf = (match_obj.group(1)) if xsrf: post_url = "https://www.zhihu.com/login/phone_num" post_date = { "_xsrf": xsrf, "password": "13247161221", "phone_num": "yun10791023", 'captcha_type': 'cn', 'remember_me': 'true', } return [scrapy.FormRequest( url = post_url, formdata = post_date, headers = self.header, callback = self.checklogin )] def checklogin(self,response): #验证服务器的返回数据判断是否成功 text_json = json.loads(response.text) #此处打断点出现:登录过于频繁,请稍后重试 if "msg" in text_json and text_json["msg"] == "登陆成功": for url in self.start_urls: yield scrapy.Request(url,dont_filter=False,headers = self.header,)

scrapy设置代理 IP 无法爬去

middewares里: class ProxyMiddleWare(object): def process_request(self, request, spider): proxy = random.choice(PROXIES) if proxy['user_passwd'] is None: # if 'user_passwd' not in proxy: # 没有代理账户验证的代理使用方式 print('---------------------->>> ', proxy['ip_port']) request.meta['proxy'] = "http://" + proxy['ip_port'] # request.meta['proxy'] = 'http://122.235.168.162:8118' else: # 对账户密码进行base64编码转换 base64_userpasswd = base64.b64encode(proxy['user_passwd'].encode()) # 对应到代理服务器的信令格式里 request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd.decode() request.meta['proxy'] = "http://" + proxy['ip_port'] ------------------------------------------------------------------------------------------------ setting里: PROXIES = [ # {'ip_port': '61.175.192.2:35420'}, # {'ip_port': '221.234.192.10:8010'}, {'ip_port': '221.224.49.194:51127', 'user_passwd': ''}, # {"ip_port": "121.41.8.23:16816", "user_passwd": "morganna_mode_g:ggc22qxp"}, # {'ip_port': '122.224.249.122:8088', 'user_passwd': 'user4:pass4'}, ] --------- DOWNLOADER_MIDDLEWARES = { # 'taobao.middlewares.TaobaoDownloaderMiddleware': 543, # 'taobao.middlewares.SeleniumMiddleware': 543, # 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 751, 'taobao.middlewares.ProxyMiddleWare': 750, 'taobao.middlewares.RandomUserAgent': 400, } 我是这么设置的,而且这个IP用request测试过,返回状态码也是200,但在scrapy里不能正常抓取数据,请教各位大神指点下

scrapy爬取知乎首页乱码

爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下: ``` import scrapy HEADERS = { 'Host': 'www.zhihu.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'Origin': 'https://www.zhihu.com', 'Referer': 'https://www.zhihu.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, headers=HEADERS) def parse(self, response): print('========== parse ==========') print(response.text[:100]) body = response.body encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1'] for encoding in encodings: try: print('========== decode ' + encoding) print(body.decode(encoding)[:100]) print('========== decode end\n') except Exception as e: print('########## decode {0}, error: {1}\n'.format(encoding, e)) pass ``` 输出的log如下: D:\workspace_python\ZhihuSpider>scrapy crawl zhihu 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider) 2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']} 2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened 2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: https://www.zhihu.com/) ========== parse ========== ��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"�� z�I�zLgü^�1� Q)Ա�_k}�䄍���/T����U�3���l��� ========== decode utf-8 ########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte ========== decode gbk ########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode gb2312 ########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence ========== decode iso-8859-1 áø~!¢ 同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。

python获取重定向的消息头的location??????????????

附件链接会发生跳转,想获取跳转后的下载链接,利用python爬取数据的。

scrapy response解析不全打印结果缺失

``` # -*- coding: utf-8 -*- import scrapy from scrapy.conf import settings class ContentSpider(scrapy.Spider): name = "content" allowed_domains = ["pkulaw.cn"] start_urls = ( 'http://www.pkulaw.cn/', ) headers = settings.get('HEADERS') surl = 'http://www.pkulaw.cn/fulltext_form.aspx?Db=chl&Gid=58178&keyword=&EncodingName=&Search_Mode=accurate' def parse(self, response): yield scrapy.Request(url=self.surl, headers=self.headers, callback=self.parse_con ) def parse_con(self, response): content = ''.join(response.xpath('.//*[@id="div_content"]').extract()) self.logger.info("--content--:%s" % content) ``` ![图片说明](https://img-ask.csdn.net/upload/201701/04/1483511185_293371.png)<p align='center'><font class='MTitle'>人才市场管理规定<br> (2001年9月11日人事部、国家工商行政管理总局令第1号发布 2005年3月22日根据《人事部、国家工商行政管理总局关于修改<人才市场管理规定>的决定》修正 2005年3月22日人事部、国家工商行政管理总局令第4号发布)</font></p> 打印结果”<人才管理规定>“不存在,有什么解决办法吗

Python爬虫,用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,有没有大神帮忙看一下,谢谢

用scrapy框架和scrapy-splash爬豆瓣读书设置代理不起作用,代理设置后还是提示需要登录。 settings内的FirstSplash.middlewares.FirstsplashSpiderMiddleware':823和FirstsplashSpiderMiddleware里面的 request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999"是从网上搜的,代理ip是从【西刺免费代理IP】这个网站随便找的一个,scrapy crawl Doubanbook打印出来的text 内容是需要登录。有没有大神帮忙看看,感谢!运行结果: ![图片说明](https://img-ask.csdn.net/upload/201904/25/1556181491_319319.jpg) <br>spider代码: ``` name = 'doubanBook' category = '' def start_requests(self): serachBook = ['python','scala','spark'] for x in serachBook: self.category = x start_urls = ['https://book.douban.com/subject_search', ] url=start_urls[0]+"?search_text="+x self.log("开始爬取:"+url) yield SplashRequest(url,self.parse_pre) def parse_pre(self, response): print(response.text) ``` 中间件代理配置: ``` class FirstsplashSpiderMiddleware(object): def process_request(self, request, spider): print("进入代理") print(request.meta['splash']['args']['proxy']) request.meta['splash']['args']['proxy'] = "'http://112.87.69.226:9999" print(request.meta['splash']['args']['proxy']) ``` settings配置: ``` BOT_NAME = 'FirstSplash' SPIDER_MODULES = ['FirstSplash.spiders'] NEWSPIDER_MODULE = 'FirstSplash.spiders' ROBOTSTXT_OBEY = False #docker+scrapy-splash配置 FEED_EXPORT_ENCODING='utf-8' #doucer服务地址 SPLASH_URL = 'http://127.0.0.1:8050' # 去重过滤器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 使用Splash的Http缓存 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' #此处配置改为splash自带配置 SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } #下载器中间件改为splash自带配置 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 'FirstSplash.middlewares.FirstsplashSpiderMiddleware':823, } # 模拟浏览器请求头 DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', } ```

scrapy 异常状态码使用代理只有第一个请求生效?

下载中间件的部分代码: def process_response(self, request, response, spider): status_code = [403] if response.status in status_code: spider.logger.debug('Error ======= %s %s , 开始使用 Proxy 代理' % (response.status, request.url)) import importlib proxy = ProxyMiddleware(settings=settings) request.meta['proxy'] = proxy.proxy_server request.headers['Proxy-Authorization'] = proxy.proxy_authorization return request else: return response 按理说,只要是状态码是403的,都会使用这个代理,直到不是403为止。 事实上运行截图:

python3爬取的内容可以print但是写入本地文件失败,但测试发现写入语句没有问题

# python3 # 源码如下 ``` python import os # 调用系统变量 import re # 正则表达式相关 import urllib import urllib.request import urllib.error import urllib.parse import json import socket import time class ImoocSpider: headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'} def getPythonInfo(self,keyWord): myKeyWord = urllib.parse.quote(keyWord) searchUrl='https://www.imooc.com/search/?words='+myKeyWord try: request = urllib.request.Request(url=searchUrl, headers=self.headers) page = urllib.request.urlopen(request) rsp = page.read().decode('unicode_escape') except UnicodeDecodeError as e: print(e) print('-----UnicodeDecodeErrorurl:', searchUrl) except urllib.error.URLError as e: print(e) print("-----urlErrorurl:", searchUrl) except socket.timeout as e: print(e) print("-----socket timout:", searchUrl) else: myres=rsp self.saveFile(myres,keyWord) finally: page.close() print("get_finally") def saveFile(self,res,keyWord): b="./" + keyWord +'.txt' if not os.path.exists(b): # 判断路径指定下是否存在文件/文件夹 try : fp=open(b,'w') print(res,file=fp) #print到文件 except : # print (e) print('文件写入有误') finally : fp.close() print('save_finally') def start (self, keyWord): self.getPythonInfo(keyWord) if __name__ == '__main__': imoocInfo = ImoocSpider() imoocInfo.start('python') ```

scrapy存到mysql查询无数据

## 1. 问题描述 尝试使用scrapy框架爬取网站,将爬取的数据存储到mysql数据库,执行完毕之后没有报错,但是我查询数据时,显示没有数据 (代码框架参考使用该博主代码尝试运行: https://www.cnblogs.com/fromlantianwei/p/10607956.html) ## 2. 部分截图 1. scrapy项目: ![图片说明](https://img-ask.csdn.net/upload/202003/04/1583310103_446281.png) 数据库创建: ![图片说明](https://img-ask.csdn.net/upload/202003/04/1583310345_774265.png) ##3. 相关代码 scrapy框架代码: (1)tencent爬虫文件 ``` # -*- coding: utf-8 -*- import scrapy from urllib import parse import re from copy import deepcopy from ScrapyPro3.items import ScrapyPro3Item class tencentSpider(scrapy.Spider): name = 'tencent' allowed_domains = [] start_urls = [ 'http://tieba.baidu.com/mo/q----,sz@320_240-1-3---2/m?kw=%E6%A1%82%E6%9E%97%E7%94%B5%E5%AD%90%E7%A7%91%E6%8A%80%E5%A4%A7%E5%AD%A6%E5%8C%97%E6%B5%B7%E6%A0%A1%E5%8C%BA&pn=26140', ] def parse(self, response): # 总页面 item = ScrapyPro3Item() all_elements = response.xpath(".//div[@class='i']") # print(all_elements) for all_element in all_elements: content = all_element.xpath("./a/text()").extract_first() content = "".join(content.split()) change = re.compile(r'[\d]+.') content = change.sub('', content) item['comment'] = content person = all_element.xpath("./p/text()").extract_first() person = "".join(person.split()) # 去掉点赞数 评论数 change2 = re.compile(r'点[\d]+回[\d]+') person = change2.sub('', person) # 选择日期 change3 = re.compile(r'[\d]?[\d]?-[\d][\d](?=)') date = change3.findall(person) # 如果为今天则选择时间 change4 = re.compile(r'[\d]?[\d]?:[\d][\d](?=)') time = change4.findall(person) person = change3.sub('', person) person = change4.sub('', person) if time == []: item['time'] = date else: item['time'] = time item['name'] = person # 增加密码 活跃 item['is_active'] = '1' item['password'] = '123456' print(item) yield item # 下一页 """next_url = 'http://tieba.baidu.com/mo/q----,sz@320_240-1-3---2/' + parse.unquote( response.xpath(".//div[@class='bc p']/a/@href").extract_first()) print(next_url) yield scrapy.Request( next_url, callback=self.parse, )""" ``` (2)item文件 ``` # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class ScrapyPro3Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() comment = scrapy.Field() time = scrapy.Field() name = scrapy.Field() password = scrapy.Field() is_active = scrapy.Field() ``` (3)pipelines文件 # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html """class Scrapypro3Pipeline(object): def process_item(self, item, spider): return item""" import pymysql from twisted.enterprise import adbapi class Scrapypro3Pipeline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls, settings): # 函数名固定,会被scrapy调用,直接可用settings的值 """ 数据库建立连接 :param settings: 配置参数 :return: 实例化参数 """ adbparams = dict( host='localhost', db='mu_ke', user='root', password='root', cursorclass=pymysql.cursors.DictCursor # 指定cursor类型 ) # 连接数据池ConnectionPool,使用pymysql或者Mysqldb连接 dbpool = adbapi.ConnectionPool('pymysql', **adbparams) # 返回实例化参数 return cls(dbpool) def process_item(self, item, spider): """ 使用twisted将MySQL插入变成异步执行。通过连接池执行具体的sql操作,返回一个对象 """ query = self.dbpool.runInteraction(self.do_insert, item) # 指定操作方法和操作数据 # 添加异常处理 query.addCallback(self.handle_error) # 处理异常 def do_insert(self, cursor, item): # 对数据库进行插入操作,并不需要commit,twisted会自动commit insert_sql = """ insert into login_person(name,password,is_active,comment,time) VALUES(%s,%s,%s,%s,%s) """ cursor.execute(insert_sql, (item['name'], item['password'], item['is_active'], item['comment'], item['time'])) def handle_error(self, failure): if failure: # 打印错误信息 print(failure)``` ``` (4) settings文件 ``` # -*- coding: utf-8 -*- # Scrapy settings for ScrapyPro3 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'ScrapyPro3' SPIDER_MODULES = ['ScrapyPro3.spiders'] NEWSPIDER_MODULE = 'ScrapyPro3.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36' MYSQL_HOST = 'localhost' MYSQL_DBNAME = 'mu_ke' MYSQL_USER = 'root' MYSQL_PASSWD = 'root' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'ScrapyPro3.middlewares.ScrapyPro3SpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'ScrapyPro3.middlewares.ScrapyPro3DownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'ScrapyPro3.pipelines.Scrapypro3Pipeline':200, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ``` (5)start文件——执行爬虫文件 ``` from scrapy import cmdline cmdline.execute(["scrapy","crawl","tencent"]) ``` 数据库创建代码: ``` create database mu_ke; CREATE TABLE `login_person` ( `id` int(10) NOT NULL AUTO_INCREMENT, `name` varchar(100) DEFAULT NULL, `passsword` varchar(100) DEFAULT NULL, `is_active` varchar(100) DEFAULT NULL, `comment` varchar(100) DEFAULT NULL, `time` varchar(100) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=1181 DEFAULT CHARSET=utf8; select count(name) from login_person;#查询结果条数为0 ``` # 运行完代码后查询数据,显示条数为0,这里面有什么问题吗? (1) 执行过程正常 (2)运行 pycharm2019.3 python3.8 mysql8.0(workbench8.0) (3) 数据连接没有

requests返回为空的问题

学生党,弄着玩 爬取微舆情 头文件,data都已更改,allow_redirects设置为False,但requests返回的值为空 代码如下 ``` import json import requests import datetime import urllib3 from urllib3.exceptions import InsecureRequestWarning urllib3.disable_warnings(InsecureRequestWarning) sess = requests.session() def run(keyword): headers = { 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Host': 'www.wrd.cn', 'Origin': 'http://www.wrd.cn', 'Referer': 'http://www.wrd.cn/goSearch.shtml', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', } endTime = datetime.datetime.now() startTime = endTime + datetime.timedelta(days=-1) data = { 'title': '%s' % (keyword), 'keyword': '%s' % (keyword), 'filterKeyword': '', 'categoryId': '', 'categoryType': '', 'secondCategory': '', 'date': '24', 'categoryLevel': '', 'startTime' : startTime.strftime("%Y-%m-%d %H:%M:%S"), 'endTime':endTime.strftime("%Y-%m-%d %H:%M:%S"), 'secondClassifyName': '', 'threeClassifyName':'', 'isAll':'', 'shareCode':'' } url = 'http://www.wrd.cn/view/openTools/goHotWorthOTChart.action' res = sess.post(headers=headers, data=data, url=url,allow_redirects=False) print(res.text) run('千佛山') ```

Python抓取程序无法运行

-*- coding: utf-8 -*- """ Spyder Editor This is a temporary script file. """ import requests as req import pandas as pd import matplotlib.pyplot as plt import time import re url='http://wenshu.court.gov.cn/List/ListContent' Index=1 SleepNum= 3 dates=[] titles=[] nums=[] while (Index < 123): my_headers={'User-Agent':'User-Agent:Mozilla/5.0(Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.95Safari/537.36 Core/1.50.1280.400',} data={'Param':'全文检索:执行', 'Index': Index,'Page':'20','Order':'裁判日期', 'Direction':'asc'} r=req.post(url,headers=my_headers, data = data) raw=r.json() pattern1= re.compile('"裁判日期":"(.*?)"',re.S) date= re.findall(pattern1,raw) pattern2= re.compile('"案号":"(.*?)"',re.S) num= re.findall(pattern2,raw) pattern3= re.compile('"案件名称":"(.*?)"',re.S) title= re.findall(pattern3,raw) dates+=date titles+=title nums+=num time.sleep(SleepNum) Index+= 1 df=pd.DataFrame({'时间':dates,'案号':nums, '案件名称':titles}) df.to_excel('E:\result.xlsx')  console内容: Python 2.7.11 |Anaconda 4.1.0 (64-bit)| (default, Jun 15 2016, 15:21:11) [MSC v.1500 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information. IPython 4.2.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. %guiref -> A brief reference about the graphical user interface. runfile('C:/Users/xx/.spyder2/temp.py', wdir='C:/Users/xx/.spyder2') Traceback (most recent call last): File "", line 1, in runfile('C:/Users/xx/.spyder2/temp.py', wdir='C:/Users/xx/.spyder2') File "D:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile execfile(filename, namespace) File "D:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc) File "C:/Users/xx/.spyder2/temp.py", line 50, in df.to_excel('E:\result.xlsx') File "D:\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1427, in to_excel excel_writer.save() File "D:\Anaconda2\lib\site-packages\pandas\io\excel.py", line 1444, in save return self.book.close() File "D:\Anaconda2\lib\site-packages\xlsxwriter\workbook.py", line 297, in close self._store_workbook() File "D:\Anaconda2\lib\site-packages\xlsxwriter\workbook.py", line 605, in _store_workbook xml_files = packager._create_package() File "D:\Anaconda2\lib\site-packages\xlsxwriter\packager.py", line 139, in _create_package self._write_shared_strings_file() File "D:\Anaconda2\lib\site-packages\xlsxwriter\packager.py", line 286, in _write_shared_strings_file sst._assemble_xml_file() File "D:\Anaconda2\lib\site-packages\xlsxwriter\sharedstrings.py", line 53, in _assemble_xml_file self._write_sst_strings() File "D:\Anaconda2\lib\site-packages\xlsxwriter\sharedstrings.py", line 83, in _write_sst_strings self._write_si(string) File "D:\Anaconda2\lib\site-packages\xlsxwriter\sharedstrings.py", line 110, in _write_si self._xml_si_element(string, attributes) File "D:\Anaconda2\lib\site-packages\xlsxwriter\xmlwriter.py", line 122, in _xml_si_element self.fh.write("""%s""" % (attr, string)) File "D:\Anaconda2\lib\codecs.py", line 706, in write return self.writer.write(data) File "D:\Anaconda2\lib\codecs.py", line 369, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 7: ordinal not in range(128) runfile('C:/Users/xx/.spyder2/temp.py', wdir='C:/Users/xx/.spyder2')ERROR: execution aborted

Python 爬虫 status 200 但得不到数据

想爬取赢商网的一些信息,response中是存在信息的。想用request.post来爬,但是得不到信息。 ``` import requests import json #%% url = 'http://yzs.winshangdata.com/wsapi/brand/getBrandTuoZhanProvinces' headers = { 'Content-Type': 'application/json;charset=UTF-8', 'Cookie': 'UM_distinctid=16f40179a3117f-08f154d235ffb8-6701b35-11442c-16f40179a322bd; Hm_lvt_f48055ef4cefec1b8213086004a7b78d=1577425731,1577425985,1577426204,1577427086; winfanguser=uid=shumiao888&nid=shumiao888_105711714&logNum=10467&err163=2dde6523ff7cdc29&pwd=06915a38d9e9e87f5ffe745c51d659&headerImg=http://user.winshangdata.com/image/default_20161129.png&sex=0&Email=&IsCompany=0; eyeuser=uid%3dshumiao888%26nid%3dshumiao888_105711714%26logNum%3d10467%26err163%3d2dde6523ff7cdc29%26pwd%3d06915a38d9e9e87f5ffe745c51d659%26headerImg%3dhttp%3a%2f%2fuser.winshangdata.com%2fimage%2fdefault_20161129.png%26sex%3d0%26Email%3d%26IsCompany%3d0; mode=mode; Hm_lpvt_f48055ef4cefec1b8213086004a7b78d=1577430750; JSESSIONID=C6734BEB6CAC2CAEB1DA02BD524B24AF; Hm_lvt_742e37d60ea288bb1d1f445eab6ce50b=1577365684,1577430757,1577430765,1577430801; Hm_lpvt_742e37d60ea288bb1d1f445eab6ce50b=1577430801', 'platform': 'yzs', 'Referer': 'http://yzs.winshangdata.com/', 'Token': 'C334D3A3244400C2E53B431BCF0A6F17.B04B0D7A106EC8A6B9F5704BCE4A9CC9.2019-12-27 15:12:37', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36' } payload = { 'brandId': "4680" } #%% res = requests.post(url, headers, json=payload) #%% result = res.content.decode('utf-8') ``` 错误是这样的。 ![图片说明](https://img-ask.csdn.net/upload/201912/27/1577432213_96048.png)

程序员的兼职技能课

获取讲师答疑方式: 在付费视频第一节(触摸命令_ALL)片头有二维码及加群流程介绍 限时福利 原价99元,今日仅需39元!购课添加小助手(微信号:csdn590)按提示还可领取价值800元的编程大礼包! 讲师介绍: 苏奕嘉&nbsp;前阿里UC项目工程师 脚本开发平台官方认证满级(六级)开发者。 我将如何教会你通过【定制脚本】赚到你人生的第一桶金? 零基础程序定制脚本开发课程,是完全针对零脚本开发经验的小白而设计,课程内容共分为3大阶段: ①前期将带你掌握Q开发语言和界面交互开发能力; ②中期通过实战来制作有具体需求的定制脚本; ③后期将解锁脚本的更高阶玩法,打通任督二脉; ④应用定制脚本合法赚取额外收入的完整经验分享,带你通过程序定制脚本开发这项副业,赚取到你的第一桶金!

Windows版YOLOv4目标检测实战:训练自己的数据集

课程演示环境:Windows10; cuda 10.2; cudnn7.6.5; Python3.7; VisualStudio2019; OpenCV3.4 需要学习ubuntu系统上YOLOv4的同学请前往:《YOLOv4目标检测实战:训练自己的数据集》 课程链接:https://edu.csdn.net/course/detail/28745 YOLOv4来了!速度和精度双提升! 与 YOLOv3 相比,新版本的 AP (精度)和 FPS (每秒帧率)分别提高了 10% 和 12%。 YOLO系列是基于深度学习的端到端实时目标检测方法。本课程将手把手地教大家使用labelImg标注和使用YOLOv4训练自己的数据集。课程实战分为两个项目:单目标检测(足球目标检测)和多目标检测(足球和梅西同时检测)。 本课程的YOLOv4使用AlexyAB/darknet,在Windows系统上做项目演示。包括:安装软件环境、安装YOLOv4、标注自己的数据集、整理自己的数据集、修改配置文件、训练自己的数据集、测试训练出的网络模型、性能统计(mAP计算)和先验框聚类分析。还将介绍改善YOLOv4目标检测性能的技巧。 除本课程《Windows版YOLOv4目标检测实战:训练自己的数据集》外,本人将推出有关YOLOv4目标检测的系列课程。请持续关注该系列的其它视频课程,包括: 《Windows版YOLOv4目标检测实战:人脸口罩佩戴检测》 《Windows版YOLOv4目标检测实战:中国交通标志识别》 《Windows版YOLOv4目标检测:原理与源码解析》

lena全身原图(非256*256版本,而是全身原图)

lena全身原图(非256*256版本,而是全身原图) lena原图很有意思,我们通常所用的256*256图片是在lena原图上截取了头部部分的256*256正方形得到的. 原图是花花公子杂志上的一个

快速入门Android开发 视频 教程 android studio

这是一门快速入门Android开发课程,顾名思义是让大家能快速入门Android开发。 学完能让你学会如下知识点: Android的发展历程 搭建Java开发环境 搭建Android开发环境 Android Studio基础使用方法 Android Studio创建项目 项目运行到模拟器 项目运行到真实手机 Android中常用控件 排查开发中的错误 Android中请求网络 常用Android开发命令 快速入门Gradle构建系统 项目实战:看美图 常用Android Studio使用技巧 项目签名打包 如何上架市场

Java调用微信支付

Java 使用微信支付 一. 准备工作 1.

汽车租赁管理系统需求分析规格说明书

汽车租赁管理系统需求分析规格说明书,这只是一个模板,如果有不会的可以借鉴一下,还是蛮详细的。。。。

C/C++跨平台研发从基础到高阶实战系列套餐

一 专题从基础的C语言核心到c++ 和stl完成基础强化; 二 再到数据结构,设计模式完成专业计算机技能强化; 三 通过跨平台网络编程,linux编程,qt界面编程,mfc编程,windows编程,c++与lua联合编程来完成应用强化 四 最后通过基于ffmpeg的音视频播放器,直播推流,屏幕录像,

程序员的算法通关课:知己知彼(第一季)

【超实用课程内容】 程序员对于算法一直又爱又恨!特别是在求职面试时,算法类问题绝对是不可逃避的提问点!本门课程作为算法面试系列的第一季,会从“知己知彼”的角度,聊聊关于算法面试的那些事~ 【哪些人适合学习这门课程?】 求职中的开发者,对于面试算法阶段缺少经验 想了解实际工作中算法相关知识 在职程序员,算法基础薄弱,急需充电 【超人气讲师】 孙秀洋&nbsp;| 服务器端工程师 硕士毕业于哈工大计算机科学与技术专业,ACM亚洲区赛铜奖获得者,先后在腾讯和百度从事一线技术研发,对算法和后端技术有深刻见解。 【课程如何观看?】 PC端:https://edu.csdn.net/course/detail/27272 移动端:CSDN 学院APP(注意不是CSDN APP哦) 本课程为录播课,课程无限观看时长,但是大家可以抓紧时间学习后一起讨论哦~

机器学习初学者必会的案例精讲

通过六个实际的编码项目,带领同学入门人工智能。这些项目涉及机器学习(回归,分类,聚类),深度学习(神经网络),底层数学算法,Weka数据挖掘,利用Git开源项目实战等。

Python入门视频精讲

Python入门视频培训课程以通俗易懂的方式讲解Python核心技术,Python基础,Python入门。适合初学者的教程,让你少走弯路! 课程内容包括:1.Python简介和安装 、2.第一个Python程序、PyCharm的使用 、3.Python基础、4.函数、5.高级特性、6.面向对象、7.模块、8.异常处理和IO操作、9.访问数据库MySQL。教学全程采用笔记+代码案例的形式讲解,通俗易懂!!!

我以为我对Mysql事务很熟,直到我遇到了阿里面试官

太惨了,面试又被吊打

深度学习原理+项目实战+算法详解+主流框架(套餐)

深度学习系列课程从深度学习基础知识点开始讲解一步步进入神经网络的世界再到卷积和递归神经网络,详解各大经典网络架构。实战部分选择当下最火爆深度学习框架PyTorch与Tensorflow/Keras,全程实战演示框架核心使用与建模方法。项目实战部分选择计算机视觉与自然语言处理领域经典项目,从零开始详解算法原理,debug模式逐行代码解读。适合准备就业和转行的同学们加入学习! 建议按照下列课程顺序来进行学习 (1)掌握深度学习必备经典网络架构 (2)深度框架实战方法 (3)计算机视觉与自然语言处理项目实战。(按照课程排列顺序即可)

Java62数据提取代码

利用苹果手机微信下面的wx.data文件提取出62数据,通过62可以实现不同设备直接登陆,可以通过文件流的方式用脚本上传到服务器进行解析

Python代码实现飞机大战

文章目录经典飞机大战一.游戏设定二.我方飞机三.敌方飞机四.发射子弹五.发放补给包六.主模块 经典飞机大战 源代码以及素材资料(图片,音频)可从下面的github中下载: 飞机大战源代码以及素材资料github项目地址链接 ————————————————————————————————————————————————————————— 不知道大家有没有打过飞机,喜不喜欢打飞机。当我第一次接触这个东西的时候,我的内心是被震撼到的。第一次接触打飞机的时候作者本人是身心愉悦的,因为周边的朋友都在打飞机, 每

2018年全国大学生计算机技能应用大赛决赛 大题

2018年全国大学生计算机技能应用大赛决赛大题,程序填空和程序设计(侵删)

Lena图像处理测试专业用图,高清完整全身原图

Lena图像处理测试专业用图,高清完整全身原图,该图片很好的包含了平坦区域、阴影和纹理等细节,这些都有益于测试各种不同的图像处理算法。它是一幅很好的测试照片!其次,由于这是一个非常有魅力女人的照片。

MySQL数据库面试题(2020最新版)

文章目录数据库基础知识为什么要使用数据库什么是SQL?什么是MySQL?数据库三大范式是什么mysql有关权限的表都有哪几个MySQL的binlog有有几种录入格式?分别有什么区别?数据类型mysql有哪些数据类型引擎MySQL存储引擎MyISAM与InnoDB区别MyISAM索引与InnoDB索引的区别?InnoDB引擎的4大特性存储引擎选择索引什么是索引?索引有哪些优缺点?索引使用场景(重点)...

verilog实现地铁系统售票

使用 verilog 实现地铁售票

Python+OpenCV计算机视觉

Python+OpenCV计算机视觉系统全面的介绍。

Python可以这样学(第四季:数据分析与科学计算可视化)

董付国老师系列教材《Python程序设计(第2版)》(ISBN:9787302436515)、《Python可以这样学》(ISBN:9787302456469)配套视频,在教材基础上又增加了大量内容,通过实例讲解numpy、scipy、pandas、statistics、matplotlib等标准库和扩展库用法。

150讲轻松搞定Python网络爬虫

【为什么学爬虫?】 &nbsp; &nbsp; &nbsp; &nbsp;1、爬虫入手容易,但是深入较难,如何写出高效率的爬虫,如何写出灵活性高可扩展的爬虫都是一项技术活。另外在爬虫过程中,经常容易遇到被反爬虫,比如字体反爬、IP识别、验证码等,如何层层攻克难点拿到想要的数据,这门课程,你都能学到! &nbsp; &nbsp; &nbsp; &nbsp;2、如果是作为一个其他行业的开发者,比如app开发,web开发,学习爬虫能让你加强对技术的认知,能够开发出更加安全的软件和网站 【课程设计】 一个完整的爬虫程序,无论大小,总体来说可以分成三个步骤,分别是: 网络请求:模拟浏览器的行为从网上抓取数据。 数据解析:将请求下来的数据进行过滤,提取我们想要的数据。 数据存储:将提取到的数据存储到硬盘或者内存中。比如用mysql数据库或者redis等。 那么本课程也是按照这几个步骤循序渐进的进行讲解,带领学生完整的掌握每个步骤的技术。另外,因为爬虫的多样性,在爬取的过程中可能会发生被反爬、效率低下等。因此我们又增加了两个章节用来提高爬虫程序的灵活性,分别是: 爬虫进阶:包括IP代理,多线程爬虫,图形验证码识别、JS加密解密、动态网页爬虫、字体反爬识别等。 Scrapy和分布式爬虫:Scrapy框架、Scrapy-redis组件、分布式爬虫等。 通过爬虫进阶的知识点我们能应付大量的反爬网站,而Scrapy框架作为一个专业的爬虫框架,使用他可以快速提高我们编写爬虫程序的效率和速度。另外如果一台机器不能满足你的需求,我们可以用分布式爬虫让多台机器帮助你快速爬取数据。 &nbsp; 从基础爬虫到商业化应用爬虫,本套课程满足您的所有需求! 【课程服务】 专属付费社群+每周三讨论会+1v1答疑

获取Linux下Ftp目录树并逐步绑定到treeview

在linux下抓取目录树,双击后获取该节点子节点(逐步生成)。另外有两个类,一个是windows下的(一次性获取目录树),一个是linux下的(足部获取目录树)

YOLOv3目标检测实战系列课程

《YOLOv3目标检测实战系列课程》旨在帮助大家掌握YOLOv3目标检测的训练、原理、源码与网络模型改进方法。 本课程的YOLOv3使用原作darknet(c语言编写),在Ubuntu系统上做项目演示。 本系列课程包括三门课: (1)《YOLOv3目标检测实战:训练自己的数据集》 包括:安装darknet、给自己的数据集打标签、整理自己的数据集、修改配置文件、训练自己的数据集、测试训练出的网络模型、性能统计(mAP计算和画出PR曲线)和先验框聚类。 (2)《YOLOv3目标检测:原理与源码解析》讲解YOLOv1、YOLOv2、YOLOv3的原理、程序流程并解析各层的源码。 (3)《YOLOv3目标检测:网络模型改进方法》讲解YOLOv3的改进方法,包括改进1:不显示指定类别目标的方法 (增加功能) ;改进2:合并BN层到卷积层 (加快推理速度) ; 改进3:使用GIoU指标和损失函数 (提高检测精度) ;改进4:tiny YOLOv3 (简化网络模型)并介绍 AlexeyAB/darknet项目。

手把手实现Java图书管理系统(附源码)

【超实用课程内容】 本课程演示的是一套基于Java的SSM框架实现的图书管理系统,主要针对计算机相关专业的正在做毕设的学生与需要项目实战练习的java人群。详细介绍了图书管理系统的实现,包括:环境搭建、系统业务、技术实现、项目运行、功能演示、系统扩展等,以通俗易懂的方式,手把手的带你从零开始运行本套图书管理系统,该项目附带全部源码可作为毕设使用。 【课程如何观看?】 PC端:https://edu.csdn.net/course/detail/27513 移动端:CSDN 学院APP(注意不是CSDN APP哦) 本课程为录播课,课程2年有效观看时长,大家可以抓紧时间学习后一起讨论哦~ 【学员专享增值服务】 源码开放 课件、课程案例代码完全开放给你,你可以根据所学知识,自行修改、优化

微信小程序开发实战之番茄时钟开发

微信小程序番茄时钟视频教程,本课程将带着各位学员开发一个小程序初级实战类项目,针对只看过官方文档而又无从下手的开发者来说,可以作为一个较好的练手项目,对于有小程序开发经验的开发者而言,可以更好加深对小程序各类组件和API 的理解,为更深层次高难度的项目做铺垫。

Java 最常见的 200+ 面试题:面试必备

这份面试清单是从我 2015 年做了 TeamLeader 之后开始收集的,一方面是给公司招聘用,另一方面是想用它来挖掘在 Java 技术栈中,还有那些知识点是我不知道的,我想找到这些技术盲点,然后修复它,以此来提高自己的技术水平。虽然我是从 2009 年就开始参加编程工作了,但我依旧觉得自己现在要学的东西很多,并且学习这些知识,让我很有成就感和满足感,那所以何乐而不为呢? 说回面试的事,这份面试...

Java基础知识面试题(2020最新版)

文章目录Java概述何为编程什么是Javajdk1.5之后的三大版本JVM、JRE和JDK的关系什么是跨平台性?原理是什么Java语言有哪些特点什么是字节码?采用字节码的最大好处是什么什么是Java程序的主类?应用程序和小程序的主类有何不同?Java应用程序与小程序之间有那些差别?Java和C++的区别Oracle JDK 和 OpenJDK 的对比基础语法数据类型Java有哪些数据类型switc...

三个项目玩转深度学习(附1G源码)

从事大数据与人工智能开发与实践约十年,钱老师亲自见证了大数据行业的发展与人工智能的从冷到热。事实证明,计算机技术的发展,算力突破,海量数据,机器人技术等,开启了第四次工业革命的序章。深度学习图像分类一直是人工智能的经典任务,是智慧零售、安防、无人驾驶等机器视觉应用领域的核心技术之一,掌握图像分类技术是机器视觉学习的重中之重。针对现有线上学习的特点与实际需求,我们开发了人工智能案例实战系列课程。打造:以项目案例实践为驱动的课程学习方式,覆盖了智能零售,智慧交通等常见领域,通过基础学习、项目案例实践、社群答疑,三维立体的方式,打造最好的学习效果。

微信小程序 实例汇总 完整项目源代码

微信小程序 实例汇总 完整项目源代码

基于西门子S7—1200的单部六层电梯设计程序,1部6层电梯

基于西门子S7—1200的单部六层电梯设计程序,1部6层电梯。 本系统控制六层电梯, 采用集选控制方式。 为了完成设定的控制任务, 主要根据电梯输入/输出点数确定PLC 的机型。 根据电梯控制的要求,

相关热词 c#设计思想 c#正则表达式 转换 c#form复制 c#写web c# 柱形图 c# wcf 服务库 c#应用程序管理器 c#数组如何赋值给数组 c#序列化应用目的博客园 c# 设置当前标注样式
立即提问