当我无法提前知道字符编码时,如何打印字符串列表?

发布于 2024-09-18 02:04:45 字数 337 浏览 4 评论 0原文

我正在使用我用 Python 编写的客户端从 Web 服务中检索名称列表。检索列表后,我将每个名称编码为 un​​icode,然后将每个名称打印到 stdout。当我到达名称“Ólafur Jóhann Ólafsson”时,出现以下错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
                    ordinal not in range(128)

由于我不知道编码是什么,如何将所有这些字符串转换为 unicode?或者你能建议一个更好的方法来处理这个问题吗?

I am retrieving a list of names from a webservice using a client I've written in Python. Upon retrieving the list, I encode each name to unicode and then print each of them to stdout. When I get to the name "Ólafur Jóhann Ólafsson", I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
                    ordinal not in range(128)

Since I cannot know what the encoding will be, how do I convert all of these strings to unicode? Or can you suggest a better way to handle this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

酷遇一生 2024-09-25 02:04:45

首先,当从文件、管道、套接字、终端等读取数据时,您将数据解码为Unicode(缺乏编码);发送/保存数据时将 Unicode 编码为适当的字节编码。我怀疑这是你问题的根源。

Web 服务应在收到的标头或数据中声明编码。 print 通常会自动将 Unicode 编码为终端的编码(通过 sys.stdout.encoding 发现),或者在没有该编码的情况下仅将 ascii 编码。如果目标编码不支持数据中的字符,您将收到 UnicodeEncodeError

由于这不是您收到的错误,因此您应该发布一些代码,以便我们可以看到您在做什么。最有可能的是,您正在对字节字符串进行编码而不是解码。这是一个例子:

>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

我在这里所做的是对字节字符串调用encode。由于 encode 需要 Unicode 字符串,因此 Python 使用默认的 ascii 编码首先将字节字符串解码为 Unicode,然后再编码为 cp437

通过解码而不是编码数据来修复此问题,然后 print 将自动编码到 stdout。只要你的终端支持数据中的字符,就会正确显示:

>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½

First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; and encode Unicode to an appropriate byte encoding when sending/persisting data. I suspect this is the root of your problem.

The web service should declare the encoding in the headers or data received. print normally automatically encodes Unicode to the terminal's encoding (discovered through sys.stdout.encoding) or in absence of that just ascii. If the characters in your data are not supported by the target encoding, you'll get a UnicodeEncodeError.

Since that is not the error you received, you should post some code so we can see what your are doing. Most likely, you are encoding a byte string instead of decoding. Here's an example of this:

>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What I did here is call encode on a byte string. Since encode requires a Unicode string, Python used the default ascii encoding to decode the byte string to Unicode first, before encoding to cp437.

Fix this by decoding instead of encoding the data, then print will encode to stdout automatically. As long as your terminal supports the characters in the data, it will display properly:

>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½
悍妇囚夫 2024-09-25 02:04:45

来自 BeautifulSoupUnicodeDammit 模块可以自动检测编码。

from BeautifulSoup import UnicodeDammit

u = UnicodeDammit("Ólafur Jóhann Ólafsson")

print u.unicode
print u.originalEncoding

The UnicodeDammit module from BeautifulSoup can automagically detect the encoding.

from BeautifulSoup import UnicodeDammit

u = UnicodeDammit("Ólafur Jóhann Ólafsson")

print u.unicode
print u.originalEncoding
哆兒滾 2024-09-25 02:04:45

此页面可能会帮助您http://wiki.python.org/moin/PrintFails

我想,您需要将这些名称打印到控制台。你真的需要它吗?或者这只是一个测试环境?如果您仅使用控制台进行测试,您可以切换到其他工具(例如单元测试)来检查您到底获得了什么值。

This page may help you http://wiki.python.org/moin/PrintFails

The problem, I guess, is that you need to print those names to console. Do you really need it? or it's just a test environment? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文