当我无法提前知道字符编码时,如何打印字符串列表?
我正在使用我用 Python 编写的客户端从 Web 服务中检索名称列表。检索列表后,我将每个名称编码为 unicode,然后将每个名称打印到 stdout。当我到达名称“Ólafur Jóhann Ólafsson”时,出现以下错误:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
由于我不知道编码是什么,如何将所有这些字符串转换为 unicode?或者你能建议一个更好的方法来处理这个问题吗?
I am retrieving a list of names from a webservice using a client I've written in Python. Upon retrieving the list, I encode each name to unicode and then print each of them to stdout. When I get to the name "Ólafur Jóhann Ólafsson", I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
Since I cannot know what the encoding will be, how do I convert all of these strings to unicode? Or can you suggest a better way to handle this problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,当从文件、管道、套接字、终端等读取数据时,您将数据解码为Unicode(缺乏编码);发送/保存数据时将 Unicode 编码为适当的字节编码。我怀疑这是你问题的根源。
Web 服务应在收到的标头或数据中声明编码。
print
通常会自动将 Unicode 编码为终端的编码(通过sys.stdout.encoding
发现),或者在没有该编码的情况下仅将ascii
编码。如果目标编码不支持数据中的字符,您将收到UnicodeEncodeError
。由于这不是您收到的错误,因此您应该发布一些代码,以便我们可以看到您在做什么。最有可能的是,您正在对字节字符串进行编码而不是解码。这是一个例子:
我在这里所做的是对字节字符串调用
encode
。由于encode
需要 Unicode 字符串,因此 Python 使用默认的ascii
编码首先将字节字符串解码为 Unicode,然后再编码为cp437
。通过解码而不是编码数据来修复此问题,然后
print
将自动编码到 stdout。只要你的终端支持数据中的字符,就会正确显示:First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; and encode Unicode to an appropriate byte encoding when sending/persisting data. I suspect this is the root of your problem.
The web service should declare the encoding in the headers or data received.
print
normally automatically encodes Unicode to the terminal's encoding (discovered throughsys.stdout.encoding
) or in absence of that justascii
. If the characters in your data are not supported by the target encoding, you'll get aUnicodeEncodeError
.Since that is not the error you received, you should post some code so we can see what your are doing. Most likely, you are encoding a byte string instead of decoding. Here's an example of this:
What I did here is call
encode
on a byte string. Sinceencode
requires a Unicode string, Python used the defaultascii
encoding to decode the byte string to Unicode first, before encoding tocp437
.Fix this by decoding instead of encoding the data, then
print
will encode to stdout automatically. As long as your terminal supports the characters in the data, it will display properly:来自
BeautifulSoup
的UnicodeDammit
模块可以自动检测编码。The
UnicodeDammit
module fromBeautifulSoup
can automagically detect the encoding.此页面可能会帮助您http://wiki.python.org/moin/PrintFails
我想,您需要将这些名称打印到控制台。你真的需要它吗?或者这只是一个测试环境?如果您仅使用控制台进行测试,您可以切换到其他工具(例如单元测试)来检查您到底获得了什么值。
This page may help you http://wiki.python.org/moin/PrintFails
The problem, I guess, is that you need to print those names to console. Do you really need it? or it's just a test environment? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get.