isMagic 2017-12-01 03:21 采纳率: 100%
浏览 4440
已采纳

scrapy爬取知乎首页乱码

爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下:

import scrapy


HEADERS = {
    'Host': 'www.zhihu.com',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Origin': 'https://www.zhihu.com',
    'Referer': 'https://www.zhihu.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}


class ZhihuSpider(scrapy.Spider):

    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, headers=HEADERS)

    def parse(self, response):
        print('========== parse ==========')
        print(response.text[:100])

        body = response.body
        encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1']
        for encoding in encodings:
            try:
                print('========== decode ' + encoding)
                print(body.decode(encoding)[:100])
                print('========== decode end\n')
            except Exception as e:
                print('########## decode {0}, error: {1}\n'.format(encoding, e))
                pass




输出的log如下:
D:\workspace_python\ZhihuSpider>scrapy crawl zhihu
2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider)
2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']}
2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened
2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://www.zhihu.com/)
========== parse ==========
��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"��
z�I�zLgü^�1�
Q)Ա�_k}�䄍���/T����U�3���l���
========== decode utf-8
########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte

========== decode gbk
########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence

========== decode gb2312
########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence

========== decode iso-8859-1
áø~!¢

同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。

  • 写回答

2条回答 默认 最新

  • 脱裤儿任风吹 2017-12-01 05:34
    关注
    HEADERS = {
        'Host': 'www.zhihu.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'Origin': 'https://www.zhihu.com',
        'Referer': 'https://www.zhihu.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
    }
    
    

    这个出的问题,保留到如下即可正常

    HEADERS = {
        'Host': 'www.zhihu.com',
        'Referer': 'https://www.zhihu.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
    }
    

    另外知乎首页是需要登录的,可以参考我的博文 http://blog.csdn.net/lifeifei1245/article/details/73076437

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 对于相关问题的求解与代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作
  • ¥15 求NPF226060磁芯的详细资料