当前位置：文江博客话题详情

Python Unicode future-proof

（unicode 错误）“unicodeescape”编解码器无法解码字节 - 带有“\u”的字符串

发布于 2024-12-07 19:51:34 字数 3555 浏览 1 评论 0 原文

在为 Python 2.6 编写代码时，考虑到 Python 3，我认为将

from __future__ import unicode_literals

某些模块放在顶部是个好主意。换句话说，我是在自找麻烦（以避免将来遇到麻烦），但我可能会在这里遗漏一些重要的知识。我希望能够传递一个表示文件路径的字符串并实例化一个像

MyObject('H:\unittests')

这样简单的对象。在 Python 2.6 中，这工作得很好，不需要使用双反斜杠或原始字符串，即使对于以 '\u..' 开头的目录，这正是我想要的。在 __init__ 方法中，我确保所有单个 \ 出现都被解释为“\\”，包括 中特殊字符之前的那些\a、\b、\f、\n、\r、 \t 和 \v （仅\x 仍然是一个问题）。使用（本地）编码将给定字符串解码为 unicode 也可以按预期工作。

准备 Python 3.x，在编辑器中模拟我的实际问题（从 Python 2.6 中的干净控制台开始），会发生以下情况：（

>>> '\u'
'\\u'
>>> r'\u'
'\\u'

好的，直到这里：'\u' 由控制台使用本地编码进行编码）

>>> from __future__ import unicode_literals
>>> '\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence

换句话说， (unicode) 字符串根本不会被解释为 unicode，也不会使用本地编码自动解码。即使对于原始字符串：

>>> r'\u'
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX

u'\u' 也是如此：

>>> u'\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence

另外，我希望 isinstance(str(''), unicode) 返回 True （事实并非如此），因为导入 unicode_literals 应该使所有字符串类型都变成 unicode。 （编辑：）因为在Python 3中，所有字符串都是Unicode字符序列，我希望 str('')) 返回这样一个 unicode 字符串，并且 type(str('')) 都是 < ;类型'unicode'> 和（因为所有字符串都是 unicode），但也意识到 不是。到处都是混乱......

问题

我怎样才能最好地传递包含“\u”的字符串？（不写 '\\u'）
from __future__ import unicode_literals 是否真的实现了所有与 Python 3. 相关的 unicode 更改，以便我获得完整的 Python 3 字符串环境？

编辑：在 Python 3 中， 是一个 Unicode 对象 和根本不存在。就我而言，我想为 Python 2(.6) 编写可在 Python 3 中工作的代码。但是当我 import unicode_literals 时，我无法检查字符串是否为因为：

如果unicode 是命名空间的一部分，我假设 unicode
不是命名空间的一部分，是在同一个模块中创建时仍然是 unicode
type(mystring) 将始终返回 Python 3 中的 unicode 文字

我的模块通常通过顶部的 #coding: UTF-8 注释以“utf-8”进行编码，而我的 locale.getdefaultlocale()[1] 返回“cp1252”。因此，如果我从控制台调用 MyObject('çça')，它在 Python 2 中会编码为“cp1252”，而在调用 MyObject('çça') 时会编码为“utf-8” 来自模块。在Python 3中，它不会被编码，而是一个unicode文字。

编辑：

我放弃了在 u （或 x ）之前避免使用 '\' 的希望。我也了解导入 unicode_literals 的限制。然而，将字符串从模块传递到控制台的多种可能组合以及每种不同编码的反之亦然，以及是否导入 unicode_literals 以及 Python 2 与 Python 3 的组合，让我想要通过实际测试创建一个概述。因此有下表。在此处输入图像描述

换句话说，type(str('')) 不会返回在 Python 3 中，但以及所有 Python 2 问题似乎都可以避免。

原文

Writing my code for Python 2.6, but with Python 3 in mind, I thought it was a good idea to put

from __future__ import unicode_literals

at the top of some modules. In other words, I am asking for troubles (to avoid them in the future), but I might be missing some important knowledge here. I want to be able to pass a string representing a filepath and instantiate an object as simple as

MyObject('H:\unittests')

In Python 2.6, this works just fine, no need to use double backslashes or a raw string, even for a directory starting with '\u..', which is exactly what I want. In the __init__ method I make sure all single \ occurences are interpreted as '\\', including those before special characters as in \a, \b, \f,\n, \r, \t and \v (only \x remains a problem). Also decoding the given string into unicode using (local) encoding works as expected.

Preparing for Python 3.x, simulating my actual problem in an editor (starting with a clean console in Python 2.6), the following happens:

>>> '\u'
'\\u'
>>> r'\u'
'\\u'

(OK until here: '\u' is encoded by the console using the local encoding)

>>> from __future__ import unicode_literals
>>> '\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence

In other words, the (unicode) string is not interpreted as unicode at all, nor does it get decoded automatically with the local encoding. Even so for a raw string:

>>> r'\u'
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX

same for u'\u':

>>> u'\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence

Also, I would expect isinstance(str(''), unicode) to return True (which it does not), because importing unicode_literals should make all string-types unicode. (edit:) Because in Python 3, all strings are sequences of Unicode characters, I would expect str('')) to return such a unicode-string, and type(str('')) to be both <type 'unicode'>, and <type 'str'> (because all strings are unicode) but also realise that <type 'unicode'> is not <type 'str'>. Confusion all around...

Questions

how can I best pass strings containing '\u'? (without writing '\\u')
does from __future__ import unicode_literals really implement all Python 3. related unicode changes so that I get a complete Python 3 string environment?

edit:
In Python 3, <type 'str'> is a Unicode object and <type 'unicode'> simply does not exist. In my case I want to write code for Python 2(.6) that will work in Python 3. But when I import unicode_literals, I cannot check if a string is of <type 'unicode'> because:

I assume unicode is not part of the namespace
if unicode is part of the namespace, a literal of <type 'str'> is still unicode when it is created in the same module
type(mystring) will always return <type 'str'> for unicode literals in Python 3

My modules use to be encoded in 'utf-8' by a # coding: UTF-8 comment at the top, while my locale.getdefaultlocale()[1] returns 'cp1252'. So if I call MyObject('çça') from my console, it is encoded as 'cp1252' in Python 2, and in 'utf-8' when calling MyObject('çça') from the module. In Python 3, it will not be encoded, but a unicode literal.

edit:

I gave up hope about being allowed to avoid using '\' before a u (or x for that matter). Also I understand the limitations of importing unicode_literals. However, the many possible combinations of passing a string from a module to the console and vica versa with each different encoding, and on top of that importing unicode_literals or not and Python 2 vs Python 3, made me want to create an overview by actual testing. Hence the table below. enter image description here

In other words, type(str('')) does not return <type 'str'> in Python 3, but <class 'str'>, and all of Python 2 problems seem to be avoided.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

睫毛上残留的泪 2024-12-14 19:51:34

AFAIK，from __future__ import unicode_literals 所做的只是将所有字符串文字设为 unicode 类型，而不是字符串类型。也就是说：

>>> type('')
<type 'str'>
>>> from __future__ import unicode_literals
>>> type('')
<type 'unicode'>

但是 str 和 unicode 仍然是不同的类型，并且它们的行为与以前一样。

>>> type(str(''))
<type 'str'>

始终是 str 类型。

关于您的 r'\u' 问题，这是设计使然，因为它相当于没有 unicode_literals 的 ru'\u'。来自文档：

当“r”或“R”前缀与“u”或“U”前缀结合使用时，\uXXXX 和 \UXXXXXXXX 转义序列将被处理，而所有其他反斜杠将保留在字符串中。< /p>

可能来自于 python2 系列中词法分析器的工作方式。在 python3 中，它按照您（和我）的预期工作。

您可以输入反斜杠两次，然后 \u 将不会被解释，但您会得到两个反斜杠！

反斜杠可以通过前面的反斜杠进行转义；但是，两者都保留在字符串中

>>> ur'\\u'
u'\\\\u'

所以恕我直言，您有两个简单的选择：

不要使用原始字符串，并转义反斜杠（与 python3 兼容）：

'H:\\unittests'
太聪明了，利用 unicode 代码点（不与 python3 兼容）：

r'H:\u005cunittests'

AFAIK, all that from __future__ import unicode_literals does is to make all string literals of unicode type, instead of string type. That is:

>>> type('')
<type 'str'>
>>> from __future__ import unicode_literals
>>> type('')
<type 'unicode'>

But str and unicode are still different types, and they behave just like before.

>>> type(str(''))
<type 'str'>

Always, is of str type.

About your r'\u' issue, it is by design, as it is equivalent to ru'\u' without unicode_literals. From the docs:

When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed while all other backslashes are left in the string.

Probably from the way the lexical analyzer worked in the python2 series. In python3 it works as you (and I) would expect.

You can type the backslash twice, and then the \u will not be interpreted, but you'll get two backslashes!

Backslashes can be escaped with a preceding backslash; however, both remain in the string

>>> ur'\\u'
u'\\\\u'

So IMHO, you have two simple options:

Do not use raw strings, and escape your backslashes (compatible with python3):

'H:\\unittests'
Be too smart and take advantage of unicode codepoints (not compatible with python3):

r'H:\u005cunittests'