爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下:
import scrapy
HEADERS = {
'Host': 'www.zhihu.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Origin': 'https://www.zhihu.com',
'Referer': 'https://www.zhihu.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}
class ZhihuSpider(scrapy.Spider):
name = 'zhihu'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, headers=HEADERS)
def parse(self, response):
print('========== parse ==========')
print(response.text[:100])
body = response.body
encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1']
for encoding in encodings:
try:
print('========== decode ' + encoding)
print(body.decode(encoding)[:100])
print('========== decode end\n')
except Exception as e:
print('########## decode {0}, error: {1}\n'.format(encoding, e))
pass
输出的log如下:
D:\workspace_python\ZhihuSpider>scrapy crawl zhihu
2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider)
2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']}
2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened
2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://www.zhihu.com/)
========== parse ==========
��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"��
z�I�zLgü^�1�
Q)Ա�_k}�䄍���/T����U�3���l���
========== decode utf-8
########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte
========== decode gbk
########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence
========== decode gb2312
########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence
========== decode iso-8859-1
áø~!¢
同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。