最近在学爬虫爬小说,遇到个网页里面有一个乱码。它网页是gb2312编码,我用gb2312、gbk、utf-8都试了一遍识别不了。因为我是在整页整页的爬文字,一报错就是一章内容没下,就很难受。
想问问大家,有没有办法直接不管那个无法编码的字符,直接将提取的内容写入?
下载代码如下
#下载
async def download(url, name):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as reques:
reques.encoding = 'gbk'
page = bs4.BeautifulSoup(await reques.text(), 'html.parser')
div = page.find('div', class_="read_chapterDetail")
p = div.find_all('p')
# 打开文件,打开方式,数据为二进制
with open(f'{name}.txt', mode='wb') as f:
for i in p:
text = i.text + '\n'
f.write(text.encode('utf-8'))
print(f'{name}下载完成!')