我查了一些关于python处理国际语言的文章,
感觉自己是用对了,可是结果还是不行
所以只能提问下了
我就是想抓下百度的首页,然后把页面上的链接和锚文本显示出来
现在唯一的问题就是在终端里打印出锚文本时,是乱码
我想转换,却报错
[code="java"]
#!/usr/local/bin/python
-*- coding: utf-8 -*-
import urllib2, htmllib, formatter
class LinksExtractor(htmllib.HTMLParser):
def __init__(self, formatter):
htmllib.HTMLParser.__init__(self, formatter)
self.links = []
self.archtexts = []
self.in_anchor = 0
def start_a(self, attrs):
# process the attributes
self.in_anchor = 1;
if len(attrs) > 0 :
for attr in attrs :
if attr[0] == "href" :
self.links.append(attr[1])
def end_a(self):
self.in_anchor = 0
def handle_data(self, text):
if self.in_anchor:
text = text.decode("GB2312")
self.archtexts.append(text)
def get_links(self) :
return self.links
print "你好"
#get html source
request = urllib2.Request('http://www.baidu.com/')
#request = urllib2.Request('http://localhost:8080/')
request.add_header('User-Agent', 'Mozilla/5.0')
opener = urllib2.build_opener()
htmlSource = opener.open(request).read()
format = formatter.NullFormatter()
htmlparser = LinksExtractor(format)
htmlparser.feed(htmlSource)
htmlparser.close()
links = htmlparser.get_links()
for i in range(len(htmlparser.links)):
temp = htmlparser.archtexts[i].encode("utf8")
print "url: %s, text: %s" % (htmlparser.links[i], temp)
#print links # print all the links
[/code]
报的错是:
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xa0 in position 0: incomplete multibyte sequence
我在代码中输出“你好”是可以在终端上正确显示的
百度的网页是gb2312
读百度网页时我 text = text.decode("GB2312"),转成unicode对象,
输出时再以utf-8编码输出到终端,就样就和“你好”一样了
我觉得应该是这样啊
不过怎么不对呢?
谢谢