weixin_43408134
raid吾
采纳率0%
2018-11-13 15:25 阅读 2.4k

python3 爬网络小说 编码gbk,utf-8均报错

很简单的爬取一个小说,但是编码遇到报错,gbk,utf-8都不行。

-*- coding: utf-8 -*-

import urllib.request
import re
import sys
import os
import urllib
from bs4 import BeautifulSoup
from urllib import request

#根据给定的网址来获取网页详细信息,得到的html就是网页的源代码

def getHtml(weburl):
webheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'}
webheaders={
'Referer':'http://www.biqukan.cc/book/20461/12592815.html',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
}
req = urllib.request.Request(url=weburl, headers=webheaders)
page = urllib.request.urlopen(req)
html = page.read()

return html.decode('gbk')

def gettext(html):
soup = BeautifulSoup(html, "lxml")

content = soup.find(class_='panel-body',id='htmlContent')
txt=content.get_text()

with open('D:\\test.txt','a') as f:
    f.write(txt)

weburl="http://www.biqukan.cc/book/20461/12592815.html"
html=getHtml(weburl)#获取该网址网页详细信息,得到的html就是网页的源代码
gettext(html)
错误信息:
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 75: il
legal multibyte sequence

还有:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 116: invali
d start byte

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

2条回答 默认 最新

  • weixin_39416561 lyhsdy 2018-11-14 02:02

    用requests模块decode("gbk")没有乱码

    import requests
    url = "http://www.biqukan.cc/book/20461/12592815.html"
    headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.15 Safari/537.36',
        }
    html=requests.get(url=url,headers=headers,verify=False).content.decode("gbk")
    print(html)
    
    
    点赞 评论 复制链接分享
  • weixin_43408134 raid吾 2018-11-14 14:39

    谢谢,我的解决办法是把不能decode的 0xb5用空格替换掉了。

    点赞 评论 复制链接分享

相关推荐