2 soilblack2012 soilblack2012 于 2013.11.16 17:50 提问

python BeautifulSoup模块解码

在IDLE中执行下面的代码出现警告
代码:

soup = BeautifulSoup(html.read().decode('utf-8','ignore'), "html")

警告是:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

官方解释是:
In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object. This lets you know that the Unicode representation is not an exact representation of the original–some data was lost. If a document contains �, but .contains_replacement_characters is False, you’ll know that the � was there originally (as it is in this paragraph) and doesn’t stand in for missing data.

我该怎么办呢?

1个回答

mengzhendream
mengzhendream   2016.10.23 15:29

BeautifulSoup(open(html_path, 'r'),"html.parser",from_encoding="iso-8859-1")

u012923215
u012923215 您说的完全正确
一年多之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
准确详细的回答,更有利于被提问者采纳,从而获得C币。复制、灌水、广告等回答会被删除,是时候展现真正的技术了!