从 URL 读取 unicode 文本文件?
我正在尝试使用 urllib 和 urllib2 读取包含法语字符的文本文件,例如“é”、“à”等。
def load(url):
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(url)
f = urlopen(req)
f.readline()
for line in f:
line = line.split('\t')
word = line[0].encode('utf-8')
我有一种感觉,read() 方法返回一个字节字符串,所以我使用encode('utf-8') 来获取unicode 值,但这给了我以下错误
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)
有人可以告诉我发生了什么事吗?任何帮助将不胜感激。谢谢!
I'm trying to use urllib and urllib2 to read from a text file that has french characters in it, like "é", "à", and so on.
def load(url):
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(url)
f = urlopen(req)
f.readline()
for line in f:
line = line.split('\t')
word = line[0].encode('utf-8')
I have a feeling that the read() method returns me a byte string, so I use encode('utf-8') to get the unicode value, but this gives me the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)
Can someone tell me what's going on? Any help would be appreciated. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,您正在从文件中读取字节。您必须做的是将字节字符串解码,而不是编码为Unicode。您瞧,它已经被编码了。如果不是,您无需对其执行任何操作。
您必须指定文件中使用的编码。如果不是
utf8
,另一个可能的可能是latin1
。或者,您知道,由于它是一个 Web 文档,您可以从标题和/或其内容中获取文档的编码,但这有点超出了您的问题范围。Yes, you're reading bytes from the file. What you must do is decode, not encode, the byte string into Unicode. It's already encoded, you see. If it wasn't, you wouldn't need to do anything with it.
You have to specify the encoding used in the file. If it's not
utf8
, another good suspect might belatin1
. Or, you know, since it's a Web document, you could fish the document's encoding out of the headers and/or its content, but that's a little beyond the scope of your question.将下面的代码放在顶部。
实际上支持unicode对于python来说并不容易。
还推荐这篇文章。
http://www.python.org/dev/peps/pep-0263
http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9
put below code at the top.
actually supporting unicode is not easy for python.
also recommand this article .
http://www.python.org/dev/peps/pep-0263
http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9