isMagic 2017-11-30 19:21 采纳率: 100%
浏览 4449
已采纳

scrapy爬取知乎首页乱码

爬取知乎首页,返回的response.text是乱码,尝试解码response.body,得到的还是乱码,不知道为什么,代码如下:

import scrapy


HEADERS = {
    'Host': 'www.zhihu.com',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Origin': 'https://www.zhihu.com',
    'Referer': 'https://www.zhihu.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}


class ZhihuSpider(scrapy.Spider):

    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, headers=HEADERS)

    def parse(self, response):
        print('========== parse ==========')
        print(response.text[:100])

        body = response.body
        encodings = ['utf-8', 'gbk', 'gb2312', 'iso-8859-1', 'latin1']
        for encoding in encodings:
            try:
                print('========== decode ' + encoding)
                print(body.decode(encoding)[:100])
                print('========== decode end\n')
            except Exception as e:
                print('########## decode {0}, error: {1}\n'.format(encoding, e))
                pass




输出的log如下:
D:\workspace_python\ZhihuSpider>scrapy crawl zhihu
2017-12-01 11:12:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ZhihuSpider)
2017-12-01 11:12:03 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'ZhihuSpider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'ZhihuSpider.spiders', 'SPIDER_MODULES': ['ZhihuSpider.spiders']}
2017-12-01 11:12:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-01 11:12:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-12-01 11:12:04 [scrapy.core.engine] INFO: Spider opened
2017-12-01 11:12:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-01 11:12:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-12-01 11:12:04 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://www.zhihu.com/)
========== parse ==========
��~!���#5���=B���_��^��ˆ� ═4�� 1���J�╗%Xi��/{�vH�"��
z�I�zLgü^�1�
Q)Ա�_k}�䄍���/T����U�3���l���
========== decode utf-8
########## decode utf-8, error: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte

========== decode gbk
########## decode gbk, error: 'gbk' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence

========== decode gb2312
########## decode gb2312, error: 'gb2312' codec can't decode byte 0xa2 in position 4: illegal multibyte sequence

========== decode iso-8859-1
áø~!¢

同样的代码,如果将爬取的网站换成douban,就一点问题都没有,百度找遍了都没找到办法,只能来这里提问了,请各位大神帮帮忙,如果爬虫搞不定,我仿的知乎后台就没数据展示了,真的很着急,。剩下不到5C币,没法悬赏,但真的需要大神的帮助。

展开全部

  • 写回答

2条回答 默认 最新

  • 脱裤儿任风吹 2017-11-30 21:34
    关注
    HEADERS = {
        'Host': 'www.zhihu.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'Origin': 'https://www.zhihu.com',
        'Referer': 'https://www.zhihu.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
    }
    
    

    这个出的问题,保留到如下即可正常

    HEADERS = {
        'Host': 'www.zhihu.com',
        'Referer': 'https://www.zhihu.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
    }
    

    另外知乎首页是需要登录的,可以参考我的博文 http://blog.csdn.net/lifeifei1245/article/details/73076437

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
    吃鱼爱挑刺 2022-06-14 11:08

    大哥好评 我把浏览器的请求头都复制过去反而全都乱码

    回复
查看更多回答(1条)
编辑
预览

报告相同问题?

悬赏问题

  • ¥15 okhttp分片上传文件,到最后一个分片提示:Invalid Content-Length
  • ¥15 有关汽车的MC9S12XS128单片机实验
  • ¥15 求c语言动态链表相关课程有偿,或能将这块知识点讲明白
  • ¥15 FLKT界面刷新异常
  • ¥15 物体双站RCS和其组成阵列后的双站RCS关系验证
  • ¥50 单细胞测序拟时序分析
  • ¥50 运行springboot项目报错
  • ¥15 FTP 站点对站点传输失败
  • ¥15 宝塔面板一键迁移使用不了
  • ¥15 求一个按键录像存储到内存卡的ESP32CAM代码
手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部