raid吾 2018-11-13 15:25 采纳率: 0%
浏览 2494

python3 爬网络小说 编码gbk,utf-8均报错

很简单的爬取一个小说,但是编码遇到报错,gbk,utf-8都不行。

-*- coding: utf-8 -*-

import urllib.request
import re
import sys
import os
import urllib
from bs4 import BeautifulSoup
from urllib import request

#根据给定的网址来获取网页详细信息,得到的html就是网页的源代码

def getHtml(weburl):
webheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'}
webheaders={
'Referer':'http://www.biqukan.cc/book/20461/12592815.html',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
}
req = urllib.request.Request(url=weburl, headers=webheaders)
page = urllib.request.urlopen(req)
html = page.read()

return html.decode('gbk')

def gettext(html):
soup = BeautifulSoup(html, "lxml")

content = soup.find(class_='panel-body',id='htmlContent')
txt=content.get_text()

with open('D:\\test.txt','a') as f:
    f.write(txt)

weburl="http://www.biqukan.cc/book/20461/12592815.html"
html=getHtml(weburl)#获取该网址网页详细信息,得到的html就是网页的源代码
gettext(html)
错误信息:
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 75: il
legal multibyte sequence

还有:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 116: invali
d start byte

  • 写回答

2条回答 默认 最新

  • lyhsdy 2018-11-14 02:02
    关注

    用requests模块decode("gbk")没有乱码

    import requests
    url = "http://www.biqukan.cc/book/20461/12592815.html"
    headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.15 Safari/537.36',
        }
    html=requests.get(url=url,headers=headers,verify=False).content.decode("gbk")
    print(html)
    
    
    评论

报告相同问题?

悬赏问题

  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥15 c++头文件不能识别CDialog