从 URL 读取 unicode 文本文件？

发布于 2025-01-03 21:09:12 字数 562 浏览 6 评论 0原文

我正在尝试使用 urllib 和 urllib2 读取包含法语字符的文本文件，例如“é”、“à”等。

def load(url):
     from urllib2 import Request, urlopen, URLError, HTTPError

     req = Request(url)

     f = urlopen(req)
     f.readline()

     for line in f:
          line = line.split('\t')
          word = line[0].encode('utf-8')

我有一种感觉，read() 方法返回一个字节字符串，所以我使用encode('utf-8') 来获取unicode 值，但这给了我以下错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)

有人可以告诉我发生了什么事吗？任何帮助将不胜感激。谢谢！

原文

I'm trying to use urllib and urllib2 to read from a text file that has french characters in it, like "é", "à", and so on.

def load(url):
     from urllib2 import Request, urlopen, URLError, HTTPError

     req = Request(url)

     f = urlopen(req)
     f.readline()

     for line in f:
          line = line.split('\t')
          word = line[0].encode('utf-8')

I have a feeling that the read() method returns me a byte string, so I use encode('utf-8') to get the unicode value, but this gives me the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 6: ordinal not in range(128)

Can someone tell me what's going on? Any help would be appreciated. Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

微凉徒眸意 2025-01-10 21:09:12

是的，您正在从文件中读取字节。您必须做的是将字节字符串解码，而不是编码为Unicode。您瞧，它已经被编码了。如果不是，您无需对其执行任何操作。

word = unicode(line[0], "utf8")

您必须指定文件中使用的编码。如果不是 utf8，另一个可能的可能是 latin1。或者，您知道，由于它是一个 Web 文档，您可以从标题和/或其内容中获取文档的编码，但这有点超出了您的问题范围。

Yes, you're reading bytes from the file. What you must do is decode, not encode, the byte string into Unicode. It's already encoded, you see. If it wasn't, you wouldn't need to do anything with it.

word = unicode(line[0], "utf8")

You have to specify the encoding used in the file. If it's not utf8, another good suspect might be latin1. Or, you know, since it's a Web document, you could fish the document's encoding out of the headers and/or its content, but that's a little beyond the scope of your question.

回复收藏 0 原文