如果不是unicode则解码
我希望我的函数接受一个参数,该参数可以是 unicode 对象或 utf-8 编码字符串。在我的函数中,我想将参数转换为 unicode。我有这样的事情:
def myfunction(text):
if not isinstance(text, unicode):
text = unicode(text, 'utf-8')
...
是否可以避免使用 isinstance?我一直在寻找对鸭子打字更友好的东西。
在我的解码实验中,我遇到了 Python 的一些奇怪的行为。例如:
>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)
或者
>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported
顺便说一下。我正在使用Python 2.6
I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:
def myfunction(text):
if not isinstance(text, unicode):
text = unicode(text, 'utf-8')
...
Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.
During my experiments with decoding, I have run into several weird behaviours of Python. For instance:
>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)
Or
>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported
By the way. I'm using Python 2.6
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以尝试使用“utf-8”编解码器对其进行解码,如果不起作用,则返回该对象。
当您获取一个 unicode 对象并使用
'utf-8'
编解码器调用其decode
方法时,Python 首先尝试将 unicode 对象转换为字符串对象,然后调用字符串对象的decode('utf-8')方法。有时从unicode对象到string对象的转换会失败,因为Python2默认使用ascii编解码器。
因此,一般来说,永远不要尝试解码 unicode 对象。或者,如果您必须尝试,请将其捕获在 try.. except 块中。可能有一些编解码器可以在 Python2 中解码 unicode 对象(见下文),但它们已在 Python3 中删除。
请参阅此 Python 错误单,了解有关该问题的有趣讨论,
还有Guido van Rossum 的博客:
You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.
When you take a unicode object and call its
decode
method with the'utf-8'
codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.
So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.
See this Python bug ticket for an interesting discussion of the issue,
and also Guido van Rossum's blog:
我不知道有什么好方法可以避免函数中的
isinstance
检查,但也许其他人会这样做。我可以指出,您引用的两个奇怪之处是因为您正在做一些没有意义的事情:尝试将已经解码为 Unicode 的东西解码为 Unicode。第一个应该如下所示,它将该字符串的 UTF-8 编码解码为 Unicode 版本:
第二个应该如下所示(不使用
u''
Unicode 字符串文字):I'm not aware of any good way to avoid the
isinstance
check in your function, but maybe someone else will be. I can point out that the two weirdnesses you cite are because you're doing something that doesn't make sense: Trying to decode into Unicode something that's already decoded into Unicode.The first should instead look like this, which decodes the UTF-8 encoding of that string into the Unicode version:
And your second should look like this (not using a
u''
Unicode string literal):