刚学python，抓中文网页遇到编码的问题，怎么转换也不行……

我查了一些关于python处理国际语言的文章，
感觉自己是用对了，可是结果还是不行
所以只能提问下了

我就是想抓下百度的首页，然后把页面上的链接和锚文本显示出来
现在唯一的问题就是在终端里打印出锚文本时，是乱码

我想转换，却报错

[code="java"]

#!/usr/local/bin/python

-- coding: utf-8 --

import urllib2, htmllib, formatter

class LinksExtractor(htmllib.HTMLParser):

def __init__(self, formatter):
    htmllib.HTMLParser.__init__(self, formatter)
    self.links = []
    self.archtexts = []
    self.in_anchor = 0

def start_a(self, attrs):
    # process the attributes
    self.in_anchor = 1;
    if len(attrs) > 0 :
        for attr in attrs :
            if attr[0] == "href" : 
                self.links.append(attr[1]) 

def end_a(self):
    self.in_anchor = 0

def handle_data(self, text):
    if self.in_anchor:
        text = text.decode("GB2312")
        self.archtexts.append(text)

def get_links(self) : 
    return self.links

print "你好"

#get html source
request = urllib2.Request('http://www.baidu.com/')
#request = urllib2.Request('http://localhost:8080/')
request.add_header('User-Agent', 'Mozilla/5.0')
opener = urllib2.build_opener()
htmlSource = opener.open(request).read()

format = formatter.NullFormatter()

htmlparser = LinksExtractor(format)

htmlparser.feed(htmlSource)

htmlparser.close()

links = htmlparser.get_links()

for i in range(len(htmlparser.links)):
temp = htmlparser.archtexts[i].encode("utf8")
print "url: %s, text: %s" % (htmlparser.links[i], temp)

#print links # print all the links

[/code]

报的错是：
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xa0 in position 0: incomplete multibyte sequence

我在代码中输出“你好”是可以在终端上正确显示的
百度的网页是gb2312
读百度网页时我 text = text.decode("GB2312")，转成unicode对象，
输出时再以utf-8编码输出到终端，就样就和“你好”一样了
我觉得应该是这样啊
不过怎么不对呢？

谢谢

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
phyeas 2009-04-30 12:50
关注
你需要encode,decode
[code="python"]

-*- coding: utf-8 -*-

import urllib2, htmllib, formatter

class LinksExtractor(htmllib.HTMLParser):

def __init__(self, formatter): htmllib.HTMLParser.__init__(self, formatter) self.links = [] self.archtexts = [] self.in_anchor = 0 def start_a(self, attrs): # process the attributes self.in_anchor = 1; if len(attrs) > 0 : for attr in attrs : if attr[0] == "href" : self.links.append(attr[1]) def end_a(self): self.in_anchor = 0 def handle_data(self, text): if self.in_anchor: text = text self.archtexts.append(text) def get_links(self) : return self.links

#get html source
request = urllib2.Request('http://www.baidu.com/')
#request = urllib2.Request('http://localhost:8080/')
request.add_header('User-Agent', 'Mozilla/5.0')
opener = urllib2.build_opener()
htmlSource = opener.open(request).read()

format = formatter.NullFormatter()

htmlparser = LinksExtractor(format)

htmlparser.feed(htmlSource)

htmlparser.close()

links = htmlparser.get_links()

for i in range(len(htmlparser.links)):
temp = htmlparser.archtexts[i]
print "url: %s, text: %s" % (htmlparser.links[i], temp)
[/code]
结果：
E:\Program Files\Python25>python test2.py
url: http://passport.baidu.com/?login&tpl=mn, text: 登录
url: http://news.baidu.com, text: 新
url: http://tieba.baidu.com, text:
url: http://zhidao.baidu.com, text: 闻
url: http://mp3.baidu.com, text: 贴
url: http://image.baidu.com, text:
url: http://video.baidu.com, text: 吧
url: /gaoji/preferences.html, text: 知
url: /gaoji/advanced.html, text:
url: http://hi.baidu.com, text: 道
url: http://www.hao123.com, text: MP3
url: /more/, text: 图
url: http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com, te
xt:
url: http://e.baidu.com, text: 片
url: http://top.baidu.com, text: 视
url: /home.html, text:
url: http://ir.baidu.com, text: 频
url: http://www.baidu.com/duty/, text: 设置
url: http://www.miibeian.gov.cn, text: 高级
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

python妙用之编码的转换详解
2021-01-20 04:15

不过，也遇到些问题：在线转换效率低(搜索占去了2/3的时间)、两款工具存在一些小问题，比如burp中涉及中文往往显示乱码。直到使用python来作为我日常编码转换工具…… 开启py转换之旅 url编码 url编码是一种...
如何利用python批量转换文件编码？例如，txt文件由UTF-16LE转为UTF-8……
2020-12-21 08:54

本篇文章将介绍如何利用Python批量转换文件编码，以解决处理数据时遇到的编码问题。我们将以一个具体的例子来说明，即如何将TXT文件从UTF-16LE编码转换为UTF-8编码。首先，我们要导入必要的库。`os`库用于操作文件...
python 编码转换函数_Python之编码转换
2020-11-23 23:04

weixin_39542043的博客转自 ...记得刚入门那个时候，自己处理编码转换问题往往是“百度：url解码、base64加密、hex……”，或者是使用一款叫做“小葵多功能转换工具”的软件，再后来直接上Burpsuite的decode...
python编码转换在线_Python之编码转换
2020-11-25 02:43

weixin_39520013的博客转自 ...记得刚入门那个时候，自己处理编码转换问题往往是“百度：url解码、base64加密、hex……”，或者是使用一款叫做“小葵多功能转换工具”的软件，再后来直接上Burpsuite的decode...
python unicode编码转换中文_python妙用之编码的转换详解
2020-11-25 23:14

weixin_39631767的博客前言记得刚入门那个时候，自己处理...不过，也遇到些问题：在线转换效率低(搜索占去了2/3的时间)、两款工具存在一些小问题，比如burp中涉及中文往往显示乱码。直到使用python来作为我日常编码转换工具……开启py转...
python字符编码转换_python字符串与url编码转换的实例方法
2020-11-20 20:55

weixin_39933713的博客 python字符串与url编码的转换实例浅谈python学习之字符编码与字符串本篇文章给大家带来的内容是浅谈python学习之字符编码与字符串。有一定的参考价值，有需要的朋友可以参考一下，希望对你们有所帮助。字符编码是...
Python 实战 | 拆分、合并、转换……请查收这份 PDF 操作手册
2024-06-18 18:27

企研数据的博客本期文章分享了几个使用 Python 处理 PDF 文件的方法，希望这些方法能够在大家遇到需要批量拆分、合并 PDF 文件时帮上忙。
python编码格式的问题_Python解析XML是出现编码问题
2020-11-25 02:49

weixin_39526564的博客 Python解析XML是出现编码问题在python中遇到编码问题是一个非常痛苦的问题。在使用Python处理XML的问题上，首先遇到的是编码问题。Python并不支持gb2312，所以面对encoding="gb2312"或encoding="utf8"的XML文件会...
python声明编码为gbk_Python中xml遇到gbk编码问题
2020-11-25 23:20

weixin_39952502的博客在python中遇到编码问题是一个非常痛苦的问题。今天修改了一个这样的问题。文件 test.xml内容如下…………….要用python解析一下文件的内容。采用minidom解析xmldoc = minidoc.parse(file_name);会出现这个错误xml....
python解码gbk_Python中xml遇到gbk编码问题
2020-11-23 23:14

weixin_39523887的博客在python中遇到编码问题是一个非常痛苦的问题。今天修改了一个这样的问题。文件 test.xml内容如下…………….要用python解析一下文件的内容。采用minidom解析xmldoc = minidoc.parse(file_name);会出现这个错误xml....
没有解决我的问题, 去提问

刚学python，抓中文网页遇到编码的问题，怎么转换也不行……

-*- coding: utf-8 -*-

1条回答 默认 最新

-*- coding: utf-8 -*-

-- coding: utf-8 --

1条回答默认最新

-- coding: utf-8 --