在为 Python 2.6 编写代码时,考虑到 Python 3,我认为将
from __future__ import unicode_literals
某些模块放在顶部是个好主意。换句话说,我是在自找麻烦(以避免将来遇到麻烦),但我可能会在这里遗漏一些重要的知识。我希望能够传递一个表示文件路径的字符串并实例化一个像
MyObject('H:\unittests')
这样简单的对象。在 Python 2.6 中,这工作得很好,不需要使用双反斜杠或原始字符串,即使对于以 '\u..'
开头的目录,这正是我想要的。在 __init__
方法中,我确保所有单个 \
出现都被解释为“\\
”,包括 中特殊字符之前的那些\a
、\b
、\f
、\n
、\r
、 \t
和 \v
(仅\x
仍然是一个问题)。使用(本地)编码将给定字符串解码为 unicode 也可以按预期工作。
准备 Python 3.x,在编辑器中模拟我的实际问题(从 Python 2.6 中的干净控制台开始),会发生以下情况:(
>>> '\u'
'\\u'
>>> r'\u'
'\\u'
好的,直到这里:'\u' 由控制台使用本地编码进行编码)
>>> from __future__ import unicode_literals
>>> '\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
换句话说, (unicode) 字符串根本不会被解释为 unicode,也不会使用本地编码自动解码。即使对于原始字符串:
>>> r'\u'
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX
u'\u'
也是如此:
>>> u'\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
另外,我希望 isinstance(str(''), unicode)
返回 True
(事实并非如此),因为导入 unicode_literals 应该使所有字符串类型都变成 unicode。 (编辑:)因为在Python 3中,所有字符串都是Unicode字符序列,我希望 str(''))
返回这样一个 unicode 字符串,并且 type(str(''))
都是 < ;类型'unicode'>
和
(因为所有字符串都是 unicode),但也意识到 不是
。到处都是混乱......
问题
- 我怎样才能最好地传递包含“
\u
”的字符串? (不写 '\\u
')
-
from __future__ import unicode_literals
是否真的实现了所有与 Python 3. 相关的 unicode 更改,以便我获得完整的 Python 3 字符串环境?
编辑:
在 Python 3 中,
是一个 Unicode 对象 和
根本不存在。就我而言,我想为 Python 2(.6) 编写可在 Python 3 中工作的代码。但是当我 import unicode_literals
时,我无法检查字符串是否为
因为:
- 如果
unicode
是命名空间的一部分,我假设 unicode
- 不是命名空间的一部分,是
在同一个模块中创建时仍然是 unicode
-
type(mystring)
将始终返回
Python 3 中的 unicode 文字
我的模块通常通过顶部的 #coding: UTF-8
注释以“utf-8”进行编码,而我的 locale.getdefaultlocale()[1]
返回“cp1252”。因此,如果我从控制台调用 MyObject('çça')
,它在 Python 2 中会编码为“cp1252”,而在调用 MyObject('çça') 时会编码为“utf-8”
来自模块。在Python 3中,它不会被编码,而是一个unicode文字。
编辑:
我放弃了在 u
(或 x
)之前避免使用 '\' 的希望。我也了解导入 unicode_literals
的限制。然而,将字符串从模块传递到控制台的多种可能组合以及每种不同编码的反之亦然,以及是否导入 unicode_literals
以及 Python 2 与 Python 3 的组合,让我想要通过实际测试创建一个概述。因此有下表。
换句话说,type(str(''))
不会返回在 Python 3 中
,但
以及所有 Python 2 问题似乎都可以避免。
Writing my code for Python 2.6, but with Python 3 in mind, I thought it was a good idea to put
from __future__ import unicode_literals
at the top of some modules. In other words, I am asking for troubles (to avoid them in the future), but I might be missing some important knowledge here. I want to be able to pass a string representing a filepath and instantiate an object as simple as
MyObject('H:\unittests')
In Python 2.6, this works just fine, no need to use double backslashes or a raw string, even for a directory starting with '\u..'
, which is exactly what I want. In the __init__
method I make sure all single \
occurences are interpreted as '\\
', including those before special characters as in \a
, \b
, \f
,\n
, \r
, \t
and \v
(only \x
remains a problem). Also decoding the given string into unicode using (local) encoding works as expected.
Preparing for Python 3.x, simulating my actual problem in an editor (starting with a clean console in Python 2.6), the following happens:
>>> '\u'
'\\u'
>>> r'\u'
'\\u'
(OK until here: '\u'
is encoded by the console using the local encoding)
>>> from __future__ import unicode_literals
>>> '\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
In other words, the (unicode) string is not interpreted as unicode at all, nor does it get decoded automatically with the local encoding. Even so for a raw string:
>>> r'\u'
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX
same for u'\u'
:
>>> u'\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
Also, I would expect isinstance(str(''), unicode)
to return True
(which it does not), because importing unicode_literals should make all string-types unicode. (edit:) Because in Python 3, all strings are sequences of Unicode characters, I would expect str(''))
to return such a unicode-string, and type(str(''))
to be both <type 'unicode'>
, and <type 'str'>
(because all strings are unicode) but also realise that <type 'unicode'> is not <type 'str'>
. Confusion all around...
Questions
- how can I best pass strings containing '
\u
'? (without writing '\\u
')
- does
from __future__ import unicode_literals
really implement all Python 3. related unicode changes so that I get a complete Python 3 string environment?
edit:
In Python 3, <type 'str'>
is a Unicode object and <type 'unicode'>
simply does not exist. In my case I want to write code for Python 2(.6) that will work in Python 3. But when I import unicode_literals
, I cannot check if a string is of <type 'unicode'>
because:
- I assume
unicode
is not part of the namespace
- if
unicode
is part of the namespace, a literal of <type 'str'>
is still unicode when it is created in the same module
type(mystring)
will always return <type 'str'>
for unicode literals in Python 3
My modules use to be encoded in 'utf-8' by a # coding: UTF-8
comment at the top, while my locale.getdefaultlocale()[1]
returns 'cp1252'. So if I call MyObject('çça')
from my console, it is encoded as 'cp1252' in Python 2, and in 'utf-8' when calling MyObject('çça')
from the module. In Python 3, it will not be encoded, but a unicode literal.
edit:
I gave up hope about being allowed to avoid using '\' before a u
(or x
for that matter). Also I understand the limitations of importing unicode_literals
. However, the many possible combinations of passing a string from a module to the console and vica versa with each different encoding, and on top of that importing unicode_literals
or not and Python 2 vs Python 3, made me want to create an overview by actual testing. Hence the table below.
In other words, type(str(''))
does not return <type 'str'>
in Python 3, but <class 'str'>
, and all of Python 2 problems seem to be avoided.
发布评论
评论(4)
AFAIK,
from __future__ import unicode_literals
所做的只是将所有字符串文字设为 unicode 类型,而不是字符串类型。也就是说:但是
str
和unicode
仍然是不同的类型,并且它们的行为与以前一样。始终是
str
类型。关于您的
r'\u'
问题,这是设计使然,因为它相当于没有unicode_literals
的 ru'\u'。来自文档:可能来自于 python2 系列中词法分析器的工作方式。在 python3 中,它按照您(和我)的预期工作。
您可以输入反斜杠两次,然后
\u
将不会被解释,但您会得到两个反斜杠!所以恕我直言,您有两个简单的选择:
不要使用原始字符串,并转义反斜杠(与 python3 兼容):
'H:\\unittests'
太聪明了,利用 unicode 代码点(不与 python3 兼容):
r'H:\u005cunittests'
AFAIK, all that
from __future__ import unicode_literals
does is to make all string literals of unicode type, instead of string type. That is:But
str
andunicode
are still different types, and they behave just like before.Always, is of
str
type.About your
r'\u'
issue, it is by design, as it is equivalent to ru'\u' withoutunicode_literals
. From the docs:Probably from the way the lexical analyzer worked in the python2 series. In python3 it works as you (and I) would expect.
You can type the backslash twice, and then the
\u
will not be interpreted, but you'll get two backslashes!So IMHO, you have two simple options:
Do not use raw strings, and escape your backslashes (compatible with python3):
'H:\\unittests'
Be too smart and take advantage of unicode codepoints (not compatible with python3):
r'H:\u005cunittests'
对我来说,这个问题与版本不是最新的有关,在这种情况下
numpy
要修复:
For me this issue related to version not up to date, in this case
numpy
To fix :
我在 Python 3 上尝试这个:
就成功了!
I try this on Python 3:
and it's worked!
当您编写包含反斜杠的字符串文字时,例如路径(在 Windows 上)或正则表达式,请使用原始字符串。这就是他们的目的。
When you're writing string literals which contain backslashes, such as paths (on Windows) or regexes, use raw strings. That's what they're for.