!@#$%^& 2013-11-16 09:50 采纳率: 50%
浏览 5118

python BeautifulSoup模块解码

在IDLE中执行下面的代码出现警告
代码:

soup = BeautifulSoup(html.read().decode('utf-8','ignore'), "html")

警告是:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

官方解释是:
In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object. This lets you know that the Unicode representation is not an exact representation of the original–some data was lost. If a document contains �, but .contains_replacement_characters is False, you’ll know that the � was there originally (as it is in this paragraph) and doesn’t stand in for missing data.

我该怎么办呢?

  • 写回答

1条回答 默认 最新

  • mengzhendream 2016-10-23 07:29
    关注

    BeautifulSoup(open(html_path, 'r'),"html.parser",from_encoding="iso-8859-1")

    评论

报告相同问题?

悬赏问题

  • ¥15 求差集那个函数有问题,有无佬可以解决
  • ¥15 【提问】基于Invest的水源涵养
  • ¥20 微信网友居然可以通过vx号找到我绑的手机号
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名
  • ¥65 汇编语言除法溢出问题