准备从 Python 2.x 转换到 3.x

发布于 2024-09-13 08:38:06 字数 615 浏览 1 评论 0原文

到目前为止，我们都知道（我希望如此），Python 3 正在慢慢开始取代 Python 2.x。当然，大多数现有代码最终移植还需要很多年的时间，但我们现在可以在 2.x 版本代码中做一些事情，以使切换更容易。

显然，看看 3.x 中的新增内容将会是有帮助，但是我们现在可以做哪些事情来使即将到来的转换更加轻松（以及在需要时更容易将更新输出到并发版本）？我正在特别考虑我们可以从哪些行开始我们的脚本，这将使早期版本的 Python 更类似于 3.x，尽管其他习惯也是受欢迎的。

我能想到的添加到脚本顶部的最明显的代码是：

from __future__ import division
from __future__ import print_function
try:
    range = xrange
except NameError:
    pass

我能想到的最明显的习惯是 "{0} {1}!".format("Hello", "World") 用于字符串格式化。

还有其他要养成的习惯和好习惯吗？

原文

As we all know by now (I hope), Python 3 is slowly beginning to replace Python 2.x. Of course it will be many MANY years before most of the existing code is finally ported, but there are things we can do right now in our version 2.x code to make the switch easier.

Obviously taking a look at what's new in 3.x will be helpful, but what are some things we can do right now to make the upcoming conversion more painless (as well as make it easier to output updates to concurrent versions if needed)? I'm specifically thinking about lines we can start our scripts off with that will make earlier versions of Python more similar to 3.x, though other habits are also welcome.

The most obvious code to add to the top of the script that I can think of is:

from __future__ import division
from __future__ import print_function
try:
    range = xrange
except NameError:
    pass

The most obvious habit thing I can think of is
"{0} {1}!".format("Hello", "World") for string formatting.

Any other lines and good habits to get into?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

难以启齿的温柔 2024-09-20 08:38:06

微观层面的改变和2to3无法充分解决的最大问题是默认字符串类型从字节到Unicode的改变。

如果您的代码需要对编码和字节 I/O 执行任何操作，则需要大量手动操作才能正确转换，以便必须是字节的内容仍然是字节，并在正确的阶段进行适当的解码。您会发现某些字符串方法（特别是 format()）和库调用需要 Unicode 字符串，因此您可能需要额外的解码/编码周期才能将字符串用作 Unicode，即使它们是实际上只是字节。

事实上，一些Python标准库模块已经使用2to3进行了粗略的转换，没有适当注意字节/unicode/编码问题，因此它们本身在什么字符串类型合适方面犯了错误，这一事实无济于事。其中一些问题正在被解决，但至少从 Python 3.0 到 3.2，您将面临 urllib、email 和 wsgiref 等需要了解字节编码的包带来的令人困惑和潜在错误的行为。

每次编写字符串文字时都要小心，这样就可以改善这个问题。对于本质上基于字符的任何内容使用 u'' 字符串，对于真正基于字节的任何内容使用 b'' 字符串，对于 '默认字符串'类型，这并不重要，或者您需要匹配库调用的字符串使用要求。

不幸的是，b'' 语法仅在 Python 2.6 中引入，因此这样做会切断早期版本的用户。

预计到达时间：

有什么区别？

天哪。嗯...

一个字节包含 0–255 范围内的值，并且可能表示二进制数据的负载（例如图像的内容）或一些文本，在这种情况下，必须选择一个标准来决定如何将一组字符映射到这些字节中。这些“编码”标准中的大多数都以相同的方式将正常的“ASCII”字符集映射到字节 0-127，因此在 Python 2 中使用字节字符串进行纯 ASCII 文本处理通常是安全的。

如果您想使用任何如果你想知道字节字符串中 ASCII 集之外的字符，你就会遇到麻烦，因为每种编码将不同的字符集映射到剩余的字节值 128–255，并且大多数编码无法将每个可能的字符映射到字节。这是所有这些问题的根源，当您将文件从一个区域设置加载到另一个区域设置的 Windows 应用程序中时，所有重音或非拉丁字母都会更改为错误的字母，从而造成难以阅读的混乱。（又名“mojibake”。）

还有“多字节”编码，它尝试通过使用多个字节来存储每个字符来将更多字符放入可用空间。这些是针对东亚语言环境引入的，因为汉字非常多。但还有 UTF-8，一种设计更好的现代多字节编码，可以容纳每个字符。

如果您正在处理多字节编码的字节字符串，那么今天您可能会这样做，因为 UTF-8 的使用非常广泛；实际上，在现代应用程序中不应该使用其他编码 - 那么您会遇到比仅仅跟踪您正在使用的编码更多的问题。 len() 将告诉您以字节为单位的长度，而不是以字符为单位的长度，如果您开始索引并更改字节，您很可能会将多字节序列分成两部分，生成无效的序列并且通常会混淆一切。

因此，Python 1.6 及更高版本具有本机 Unicode 字符串（拼写为 u'something'），其中字符串中的每个单元都是一个字符，而不是一个字节。您可以len()它们，切片它们，替换它们，正则表达式它们，它们将始终表现得适当。对于文本处理任务，它们无疑更好，这就是为什么 Python 3 将它们设为默认字符串类型（无需在 '' 之前放置 u）。

问题在于，许多现有接口（例如 Windows 以外的操作系统上的文件名、HTTP 或 SMTP）主要是基于字节的，具有单独的指定编码方式。因此，当您处理需要字节的组件时，您必须注意将 unicode 字符串正确编码为字节，并且在 Python 3 中，您必须在以前不需要的某些地方显式执行此操作。

这是一个内部实现细节，Unicode 字符串内部每个单元占用“两个字节”的存储空间。你永远看不到那个存储空间；你不应该以字节为单位来考虑它。您正在处理的单位在概念上是字符，无论 Python 如何选择在内存中表示它们。

...旁白：

这并不完全正确。在 Python 的“窄版本”（例如 Windows 版本）中，Unicode 字符串的每个单元从技术上来说并不是一个字符，而是一个 UTF-16“代码单元”。对于基本多语言平面中的字符，从 0x0000–0xFFFF 您不会注意到任何差异，但如果您使用此 16 位范围之外的字符，即“星体平面”中的字符，您会发现它们采用两个单位而不是一个单位，并且，再次，当你对它们进行切片时，你可能会面临分裂字符的风险。

这种情况非常糟糕，而且之所以发生，是因为在 Unicode 超过 65,000 个字符的限制之前，Windows（以及其他语言，例如 Java）就选择了 UTF-16 作为内存存储机制。然而，这些扩展字符的使用仍然相当罕见，Windows 上的任何人都会习惯于它们破坏许多应用程序，因此这对您来说可能并不重要。

在“宽构建”中，Unicode 字符串由真实字符“代码点”单元组成，因此即使是 BMP 之外的扩展字符也可以一致且轻松地处理。为此付出的代价是效率：每个字符串单元占用内存中的四个字节的存储空间。

The biggest problem that cannot be adequately addressed by micro-level changes and 2to3 is the change of the default string type from bytes to Unicode.

If your code needs to do anything with encodings and byte I/O, it's going to need a bunch of manual effort to convert correctly, so that things that have to be bytes remain bytes, and are decoded appropriately at the right stage. You'll find that some string methods (in particular format()) and library calls require Unicode strings, so you may need extra decode/encode cycles just to use the strings as Unicode even if they're really just bytes.

This is not helped by the fact that some of the Python standard library modules have been crudely converted using 2to3 without proper attention to bytes/unicode/encoding issues, and so themselves make mistakes about what string type is appropriate. Some of this is being thrashed out, but at least from Python 3.0 to 3.2 you will face confusing and potentially buggy behaviour from packages like urllib, email and wsgiref that need to know about byte encodings.

You can ameliorate the problem by being careful every time you write a string literal. Use u'' strings for anything that's inherently character-based, b'' strings for anything that's really bytes, and '' for the ‘default string’ type where it doesn't matter or you need to match a library call's string use requirements.

Unfortunately the b'' syntax was only introduced in Python 2.6, so doing this cuts off users of earlier versions.

eta:

what's the difference?

Oh my. Well...

A byte contains a value in the range 0–255, and may represent a load of binary data (eg. the contents of an image) or some text, in which case there has to be a standard chosen for how to map a set of characters into those bytes. Most of these ‘encoding’ standards map the normal ‘ASCII’ character set into the bytes 0–127 in the same way, so it's generally safe to use byte strings for ASCII-only text processing in Python 2.

If you want to use any of the characters outside the ASCII set in a byte string, you're in trouble, because each encoding maps a different set of characters into the remaining byte values 128–255, and most encodings can't map every possible character to bytes. This is the source of all those problems where you load a file from one locale into a Windows app in another locale and all the accented or non-Latin letters change to the wrong ones, making an unreadable mess. (aka ‘mojibake’.)

There are also ‘multibyte’ encodings, which try to fit more characters into the available space by using more than one byte to store each character. These were introduced for East Asian locales, as there are so very many Chinese characters. But there's also UTF-8, a better-designed modern multibyte encoding which can accommodate every character.

If you are working on byte strings in a multibyte encoding—and today you probably will be, because UTF-8 is very widely used; really, no other encoding should be used in a modern application—then you've got even more problems than just keeping track of what encoding you're playing with. len() is going to be telling you the length in bytes, not the length in characters, and if you start indexing and altering the bytes you're very likely to break a multibyte sequence in two, generating an invalid sequence and generally confusing everything.

For this reason, Python 1.6 and later have native Unicode strings (spelled u'something'), where each unit in the string is a character, not a byte. You can len() them, slice them, replace them, regex them, and they'll always behave appropriately. For text processing tasks they are indubitably better, which is why Python 3 makes them the default string type (without having to put a u before the '').

The catch is that a lot of existing interfaces, such as filenames on OSes other than Windows, or HTTP, or SMTP, are primarily byte-based, with a separate way of specifying the encoding. So when you are dealing with components that need bytes you have to take care to encode your unicode strings to bytes correctly, and in Python 3 you will have to do it explicitly in some places where before you didn't need to.

It is an internal implementation detail that Unicode strings take ‘two bytes’ of storage per unit internally. You never get to see that storage; you shouldn't think of it in terms of bytes. The units you are working on are conceptually characters, regardless of how Python chooses to represent them in memory.

...aside:

This isn't quite true. On ‘narrow builds’ of Python like the Windows build, each unit of a Unicode string is not technically a character, but a UTF-16 ‘code unit’. For the characters in the Basic Multilingual Plane, from 0x0000–0xFFFF you won't notice any difference, but if you're using characters from outside this 16-bit range, those in the ‘astral planes’, you'll find they take two units instead of one, and, again, you risk splitting a character when you slice them.

This is pretty bad, and has happened because Windows (and others, such as Java) settled on UTF-16 as an in-memory storage mechanism before Unicode grew beyond the 65,000-character limit. However, use of these extended characters is still pretty rare, and anyone on Windows will be used to them breaking in many applications, so it's likely not critical for you.

On ‘wide builds’, Unicode strings are made of real character ‘code point’ units, so even the extended characters outside of the BMP can be handled consistently and easily. The price to pay for this is efficiency: each string unit takes up four bytes of storage in memory.

回复收藏 0 原文