Python 字符串连接惯用语。需要澄清。

发布于 2024-08-16 08:06:30 字数 835 浏览 1 评论 0原文

来自 http://jaynes.colorado.edu/PythonIdioms.html

“将字符串构建为列表并使用 ''.在最后加入。 join 是一个字符串 在分隔符上调用的方法,而不是 名单。从空处呼唤它 字符串连接没有的片段 分隔符,这是 Python 的一个怪癖 起初相当令人惊讶。这是 重要:用 + 构建字符串是 二次时间而不是线性时间!如果 你学一种习语,就学这个。

错误:对于字符串中的 s:结果 += s

右:结果 = ''.join(strings)"

我不确定为什么这是真的。如果我有一些字符串,我想加入它们,对我来说不是对我来说,直观上最好将它们放入列表中,然后调用 ''.join 将它们放入列表中不会产生一些开销吗?

Python 命令行:

>>> str1 = 'Not'
>>> str2 = 'Cool'
>>> str3 = ''.join([str1, ' ', str2]) #The more efficient way **A**
>>> print str3
Not Cool
>>> str3 = str1 + ' ' + str2 #The bad way **B**
>>> print str3
Not Cool

A 真的是线性时间吗 ? B 是二次时间吗?

From http://jaynes.colorado.edu/PythonIdioms.html

"Build strings as a list and use
''.join at the end. join is a string
method called on the separator, not
the list. Calling it from the empty
string concatenates the pieces with no
separator, which is a Python quirk and
rather surprising at first. This is
important: string building with + is
quadratic time instead of linear! If
you learn one idiom, learn this one.

Wrong: for s in strings: result += s

Right: result = ''.join(strings)"

I'm not sure why this is true. If I have some strings I want to join them, for me it isn't intuitively better to me to put them in a list then call ''.join. Doesn't putting them into a list create some overhead? To Clarify...

Python Command Line:

>>> str1 = 'Not'
>>> str2 = 'Cool'
>>> str3 = ''.join([str1, ' ', str2]) #The more efficient way **A**
>>> print str3
Not Cool
>>> str3 = str1 + ' ' + str2 #The bad way **B**
>>> print str3
Not Cool

Is A really linear time and B is quadratic time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

烟燃烟灭 2024-08-23 08:06:30

是的。对于您选择的示例,重要性尚不清楚,因为您只有两个非常短的字符串,因此附加可能会更快。

但每次在 Python 中对字符串执行 a + b 时,都会导致新的分配,然后将 a 和 b 中的所有字节复制到新字符串中。如果您在具有大量字符串的循环中执行此操作,则必须一次又一次地复制这些字节,并且每次必须复制的数量都会变得更长。这给出了二次行为。

另一方面,创建字符串列表不会复制字符串的内容 - 它只是复制引用。这是令人难以置信的快,并且以线性时间运行。然后,join 方法仅进行一次内存分配,并将每个字符串仅复制到正确的位置一次。这也只需要线性时间。

所以,是的,如果您可能要处理大量字符串,请使用 ''.join 习惯用法。对于只有两个字符串来说,这并不重要。

如果您需要更有说服力,请尝试自己创建一个由 10M 个字符组成的字符串:

>>> chars = ['a'] * 10000000
>>> r = ''
>>> for c in chars: r += c
>>> print len(r)

比较:

>>> chars = ['a'] * 10000000
>>> r = ''.join(chars)
>>> print len(r)

第一种方法大约需要 10 秒。第二次耗时不到 1 秒。

Yes. For the examples you chose the importance isn't clear because you only have two very short strings so the append would probably be faster.

But every time you do a + b with strings in Python it causes a new allocation and then copies all the bytes from a and b into the new string. If you do this in a loop with lots of strings these bytes have to be copied again, and again, and again and each time the amount that has to be copied gets longer. This gives the quadratic behaviour.

On the other hand, creating a list of strings doesn't copy the contents of the strings - it just copies the references. This is incredibly fast, and runs in linear time. The join method then makes just one memory allocation and copies each string into the correct position only once. This also takes only linear time.

So yes, do use the ''.join idiom if you are potentially dealing with a large number of strings. For just two strings it doesn't matter.

If you need more convincing, try it for yourself creating a string from 10M characters:

>>> chars = ['a'] * 10000000
>>> r = ''
>>> for c in chars: r += c
>>> print len(r)

Compared with:

>>> chars = ['a'] * 10000000
>>> r = ''.join(chars)
>>> print len(r)

The first method takes about 10 seconds. The second takes under 1 second.

墨小沫ゞ 2024-08-23 08:06:30

重复串联是二次的,因为它是 Schlemiel the Painter 算法(请注意,某些实现会优化此算法)远离,使其不是二次的)。 join 避免了这种情况,因为它需要整个字符串列表,分配必要的空间并一次性完成连接。

Repeated concatenation is quadratic because it's Schlemiel the Painter's Algorithm (beware that some implementations will optimize this away so that it is not quadratic). join avoids this because it takes the entire list of strings, allocates the necessary space and does the concatenation in one pass.

半城柳色半声笛 2024-08-23 08:06:30

当您编写s1 + s2时,Python需要分配一个新的字符串对象,将s1的所有字符复制到其中,然后将s2的所有字符复制到其中代码>.这个简单的操作不承担二次时间成本:成本是O(len(s1) + len(s2))(加上一个用于分配的常量,但这不在big-O中计算; -)。

但是,请考虑您给出的引用中的代码:for s in strings: result += s

这里,每次添加一个新s所有之前的都必须首先复制到新分配的空间中以获取结果 (字符串是不可变的,因此必须进行新的分配和复制)。假设你有 N 个长度为 L 的字符串:第一次复制 L 个字符,第二次复制 2 * L,第三次复制 3 * L...总之,这使得它 L * N * (N+1) / 2 个字符被复制...所以,是的,它是 N

的二次算法。在其他一些情况下,对于足够小的 N 值,二次算法可能比线性算法更快(因为乘数和固定固定成本可能要小得多);但这里的情况并非如此,因为分配成本很高(直接和间接,因为内存碎片的可能性)。相比之下,将字符串累积到列表中的开销基本上可以忽略不计。

When you code s1 + s2, Python needs to allocate a new string object, copy all characters of s1 into it, then after that all characters of s2. This trivial operation does not bear quadratic time costs: the cost is O(len(s1) + len(s2)) (plus a constant for allocation, but that doesn't figure in big-O;-).

However, consider the code in the quote you're giving: for s in strings: result += s.

Here, every time a new s is added, all the previous ones have to be first copied into the newly allocated space for result (strings are immutable, so the new allocation and copy must take place). Suppose you have N strings of length L: you'll copy L characters the first time, then 2 * L the second time, then 3 * L the third time... in all, that makes it L * N * (N+1) / 2 characters getting copied... so, yep, it's quadratic in N.

In some other cases, a quadratic algorithm may be faster than a linear one for small-enough values of N (because the multipliers and constant fixed-costs may be much smaller); but that's not the case here because allocations are costly (both directly, and indirectly because of the likelihood of fragmenting memory). In comparison, the overheads of accumulating the strings into a list is essentially negligible.

云巢 2024-08-23 08:06:30

Joel 在回归基础中对此进行了介绍。

Joel writes about this in Back to Basics.

苏佲洛 2024-08-23 08:06:30

如果您指的是与其他人相同的事物,那就不太明显了。当您有许多个字符串时,例如长度为NM,这种优化非常重要。那么

A

x = ''.join(strings) # Takes M*N operations 

B

x = ''
for s in strings:
    x = x + s  # Takes N + 2*N + ... + M*N operations

除非通过实现进行优化,是的,A 与总长度呈线性 T = M*N 和 BT*T / N,如果 M >>> ,它总是更差并且大致是二次方。 N。

现在为什么join实际上非常直观:当您说“我有一些字符串”时,通常可以通过说您有一个返回字符串的迭代器来形式化。现在,这正是您传递给 "string".join() 的内容

It's not obvious if you're referring to the same thing as other people. This optimization is important when you have many strings, say M of length N. Then

A

x = ''.join(strings) # Takes M*N operations 

B

x = ''
for s in strings:
    x = x + s  # Takes N + 2*N + ... + M*N operations

Unless optimized away by the implementation, yes, A is linear in the total length T = M*N and B is T*T / N which is always worse and roughly quadratic if M >> N.

Now why it is actually quite intuitive to join: when you say "I have some strings" this typically can be formalized by saying that you have an iterator that returns strings. Now, this is exactly what you pass to "string".join()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文