如果不是unicode则解码

发布于 2024-09-26 05:06:40 字数 1019 浏览 4 评论 0原文

我希望我的函数接受一个参数,该参数可以是 unicode 对象或 utf-8 编码字符串。在我的函数中,我想将参数转换为 unicode。我有这样的事情:

def myfunction(text):
    if not isinstance(text, unicode):
        text = unicode(text, 'utf-8')

    ...

是否可以避免使用 isinstance?我一直在寻找对鸭子打字更友好的东西。

在我的解码实验中,我遇到了 Python 的一些奇怪的行为。例如:

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)

或者

>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported

顺便说一下。我正在使用Python 2.6

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:

def myfunction(text):
    if not isinstance(text, unicode):
        text = unicode(text, 'utf-8')

    ...

Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.

During my experiments with decoding, I have run into several weird behaviours of Python. For instance:

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)

Or

>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported

By the way. I'm using Python 2.6

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

机场等船 2024-10-03 05:06:40

您可以尝试使用“utf-8”编解码器对其进行解码,如果不起作用,则返回该对象。

def myfunction(text):
    try:
        text = unicode(text, 'utf-8')
    except TypeError:
        return text

print(myfunction(u'cer\xf3n'))
# cerón

当您获取一个 unicode 对象并使用 'utf-8' 编解码器调用其 decode 方法时,Python 首先尝试将 unicode 对象转换为字符串对象,然后调用字符串对象的decode('utf-8')方法。

有时从unicode对象到string对象的转换会失败,因为Python2默认使用ascii编解码器。

因此,一般来说,永远不要尝试解码 unicode 对象。或者,如果您必须尝试,请将其捕获在 try.. except 块中。可能有一些编解码器可以在 Python2 中解码 unicode 对象(见下文),但它们已在 Python3 中删除。

请参阅此 Python 错误单,了解有关该问题的有趣讨论,
还有Guido van Rossum 的博客

“我们正在采用略有不同的
编解码器方法:在 Python 2 中,
编解码器可以接受 Unicode 或
8 位作为输入并产生
输出,在 Py3k 中,编码始终是
Unicode 翻译(文本)
字符串到字节数组,以及
解码总是相反的
方向。
这意味着我们必须
删除一些不适合的编解码器
该模型,例如 rot13、base64
和 bz2 (这些转换仍然是
支持,只是不是通过
编码/解码 API)。”

You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.

def myfunction(text):
    try:
        text = unicode(text, 'utf-8')
    except TypeError:
        return text

print(myfunction(u'cer\xf3n'))
# cerón

When you take a unicode object and call its decode method with the 'utf-8' codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.

Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.

So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.

See this Python bug ticket for an interesting discussion of the issue,
and also Guido van Rossum's blog:

"We are adopting a slightly different
approach to codecs: while in Python 2,
codecs can accept either Unicode or
8-bits as input and produce either as
output, in Py3k, encoding is always a
translation from a Unicode (text)
string to an array of bytes, and
decoding always goes the opposite
direction.
This means that we had to
drop a few codecs that don't fit in
this model, for example rot13, base64
and bz2 (those conversions are still
supported, just not through the
encode/decode API)."

迷离° 2024-10-03 05:06:40

我不知道有什么好方法可以避免函数中的 isinstance 检查,但也许其他人会这样做。我可以指出,您引用的两个奇怪之处是因为您正在做一些没有意义的事情:尝试将已经解码为 Unicode 的东西解码为 Unicode。

第一个应该如下所示,它将该字符串的 UTF-8 编码解码为 Unicode 版本:

>>> 'cer\xc3\xb3n'.decode('utf-8')
u'cer\xf3n'

第二个应该如下所示(不使用 u'' Unicode 字符串文字):

>>> unicode('hello', 'utf-8')
u'hello'

I'm not aware of any good way to avoid the isinstance check in your function, but maybe someone else will be. I can point out that the two weirdnesses you cite are because you're doing something that doesn't make sense: Trying to decode into Unicode something that's already decoded into Unicode.

The first should instead look like this, which decodes the UTF-8 encoding of that string into the Unicode version:

>>> 'cer\xc3\xb3n'.decode('utf-8')
u'cer\xf3n'

And your second should look like this (not using a u'' Unicode string literal):

>>> unicode('hello', 'utf-8')
u'hello'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文