解码函数尝试对 Python 进行编码

发布于 2024-10-14 10:31:35 字数 603 浏览 7 评论 0原文

我正在尝试打印一个 unicode 字符串,其中不包含特定的编码十六进制。我从 facebook 获取此数据,该数据的 html 标头中的编码类型为 UTF-8。当我打印类型时 - 它说它是 unicode,但是当我尝试使用 unicode-escape 对其进行解码时,它说存在编码错误。为什么当我使用解码方法时它会尝试编码?

代码

a='really long string of unicode html text that i wont reprint'
print type(a)
 >>> <type 'unicode'>   
print a.decode('unicode-escape')
 >>> Traceback (most recent call last):
  File "scfbp.py", line 203, in myFunctionPage
    print a.decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 1945: ordinal not in range(128)

I am trying to print a unicode string without the specific encoding hex in it. I'm grabbing this data from facebook which has an encoding type in the html headers of UTF-8. When I print the type - it says its unicode, but then when I try to decode it with unicode-escape says there is an encoding error. Why is it trying to encode when I use the decode method?

Code

a='really long string of unicode html text that i wont reprint'
print type(a)
 >>> <type 'unicode'>   
print a.decode('unicode-escape')
 >>> Traceback (most recent call last):
  File "scfbp.py", line 203, in myFunctionPage
    print a.decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 1945: ordinal not in range(128)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

⒈起吃苦の倖褔 2024-10-21 10:31:35

这不是解码失败。这是因为您正在尝试将结果显示到控制台。当您使用 print 时,它会使用默认编码 ASCII 对字符串进行编码。不要使用打印,它应该可以工作。

>>> a=u'really long string containing \\u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
u'really long string containing \u20ac and some other text'
>>> print a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

我建议使用 IDLE 或其他可以输出 unicode 的解释器,这样你就不会遇到这个问题。


更新:请注意,这与少一个反斜杠的情况不同,后者在解码过程中失败,但具有相同的错误消息:

>>> a=u'really long string containing \u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

It's not the decode that's failing. It's because you are trying to display the result to the console. When you use print it encodes the string using the default encoding which is ASCII. Don't use print and it should work.

>>> a=u'really long string containing \\u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
u'really long string containing \u20ac and some other text'
>>> print a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

I'd recommend using IDLE or some other interpreter that can output unicode, then you won't get this problem.


Update: Note that this is not the same as the situtation with one less backslash, where it fails during the decode, but with the same error message:

>>> a=u'really long string containing \u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)
写下不归期 2024-10-21 10:31:35

当您打印到控制台时,Python 尝试将字符串编码(转换)为终端的字符集。如果这不是 UTF-8,或者没有映射字符串中的所有字符,它会抱怨并抛出异常。

当我快速处理数据(例如其中包含土耳其字符)时,这会时不时地困扰我。

如果您通过 Windows 命令提示符运行 python.exe,您可以在此处找到一些解决方案: cmd.exe 使用什么编码/代码页。基本上你可以使用chcp更改代码页,但它非常麻烦。我会遵循Mark的建议并使用IDLE之类的东西。

When you print to the console Python tries to encode (convert) the string to the character set of your terminal. If this is not UTF-8, or something that doesn't map all the characters in the string, it will whine and throw an exception.

This snags me every now and then when I do quick processing of data, with for example Turkish characters in it.

If you are running python.exe through the Windows command prompt you can find some solutions here: What encoding/code page is cmd.exe using. Basically you can change the codepage with chcp but it's quite cumbersome. I would follow Mark's advice and use something like IDLE.

泪意 2024-10-21 10:31:35
>>> print type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')

为什么当我使用decode方法时它却试图编码?

因为您将解码为 Unicode,并编码。您刚刚尝试将 unicode 字符串解码为 un​​icode。然后它做的第一件事就是尝试使用 ascii 编解码器将其转换为字符串。这就是为什么你会得到:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2110' in position 3: ordinal not in range(128)

记住:Unicode 不是一种编码。其他的都是,比如 ascii、utf8、latin-1 等。

顺便说一句,这种隐式编码在 Python 3 中消失了,因为它让人们感到困惑。

>>> print type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')

Why is it trying to encode when I use the decode method?

Because you decode to Unicode, and you encode from. You just tried to decode a unicode string to unicode. The first thing it then does is try to convert it to a string, with the ascii codec. That's why you get:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2110' in position 3: ordinal not in range(128)

Remember: Unicode is not an encoding. Everything else is, like ascii, utf8, latin-1 etc.

This implicit encoding is gone in Python 3, btw, because it confuses people.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文