从 URL 读取 unicode 文本文件?

发布于 2025-01-03 21:09:12 字数 562 浏览 1 评论 0原文

我正在尝试使用 urllib 和 urllib2 读取包含法语字符的文本文件,例如“é”、“à”等。

def load(url):
     from urllib2 import Request, urlopen, URLError, HTTPError

     req = Request(url)

     f = urlopen(req)
     f.readline()

     for line in f:
          line = line.split('\t')
          word = line[0].encode('utf-8')

我有一种感觉,read() 方法返回一个字节字符串,所以我使用encode('utf-8') 来获取unicode 值,但这给了我以下错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)

有人可以告诉我发生了什么事吗?任何帮助将不胜感激。谢谢!

I'm trying to use urllib and urllib2 to read from a text file that has french characters in it, like "é", "à", and so on.

def load(url):
     from urllib2 import Request, urlopen, URLError, HTTPError

     req = Request(url)

     f = urlopen(req)
     f.readline()

     for line in f:
          line = line.split('\t')
          word = line[0].encode('utf-8')

I have a feeling that the read() method returns me a byte string, so I use encode('utf-8') to get the unicode value, but this gives me the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)

Can someone tell me what's going on? Any help would be appreciated. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

微凉徒眸意 2025-01-10 21:09:12

是的,您正在从文件中读取字节。您必须做的是将字节字符串解码,而不是编码为Unicode。您瞧,它已经被编码了。如果不是,您无需对其执行任何操作。

word = unicode(line[0], "utf8")

您必须指定文件中使用的编码。如果不是 utf8,另一个可能的可能是 latin1。或者,您知道,由于它是一个 Web 文档,您可以从标题和/或其内容中获取文档的编码,但这有点超出了您的问题范围。

Yes, you're reading bytes from the file. What you must do is decode, not encode, the byte string into Unicode. It's already encoded, you see. If it wasn't, you wouldn't need to do anything with it.

word = unicode(line[0], "utf8")

You have to specify the encoding used in the file. If it's not utf8, another good suspect might be latin1. Or, you know, since it's a Web document, you could fish the document's encoding out of the headers and/or its content, but that's a little beyond the scope of your question.

怎会甘心 2025-01-10 21:09:12

将下面的代码放在顶部。

# coding: utf-8

实际上支持unicode对于python来说并不容易。
还推荐这篇文章。

http://www.python.org/dev/peps/pep-0263

http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9

put below code at the top.

# coding: utf-8

actually supporting unicode is not easy for python.
also recommand this article .

http://www.python.org/dev/peps/pep-0263

http://groups.google.com/group/python-excel/browse_thread/thread/100ec019d3a2a1a9

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文