python中urllib2的解码问题
我正在尝试使用 python 2.7 中的 urllib2 从网络获取页面。该页面恰好以 unicode(UTF-8) 编码并包含希腊字符。当我尝试使用下面的代码获取并打印它时,我得到的是乱码而不是希腊字符。
import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()
Netbeans 6.9.1 和 Windows 7 CLI 中的结果相同。
我做错了什么,但是什么?
I'm trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below, I get gibberish instead of the greek characters.
import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()
The result is the same both in Netbeans 6.9.1 and in Windows 7 CLI.
I'm doing something wrong, but what?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Unicode不是UTF-8。 UTF-8 是一种字符串编码,如 ISO-8859-1、ASCII 等。
始终尽快解码您的数据,以将其转换为真正的 Unicode。 (
'somestring in utf8'.decode('utf-8') == u'somestring in utf-8'
),unicode 对象是u''
,而不是 < code>''当您有数据离开应用时,请始终以正确的编码对其进行编码。对于 Web 内容,主要是
utf-8
。对于控制台的东西,这就是你的控制台编码是什么。在 Windows 上,默认情况下不是UTF-8。Unicode is not UTF-8. UTF-8 is a string encoding, like ISO-8859-1, ASCII etc.
Always decode your data as soon as possible, to make real Unicode out of it. (
'somestring in utf8'.decode('utf-8') == u'somestring in utf-8'
), unicode objects areu''
, not''
When you have data leaving your app, always encode it in the proper encoding. For Web stuff this is
utf-8
mostly. For console stuff this is whatever your console encoding is. On Windows this is not UTF-8 by default.它对我来说打印也正确。
检查您正在其中查看 HTML 源代码的程序的字符编码。例如,在Linux终端中,您可以找到“设置字符编码”并确保它是UTF-8。
It prints correctly for me, too.
Check the character encoding of the program in which you are viewing the HTML source code. For example, in a Linux terminal, you can find "Set Character Encoding" and make sure it is UTF-8.