为什么是“加入”?比普通串联更快?

发布于 2024-08-22 08:02:47 字数 254 浏览 3 评论 0原文

我见过来自不同语言的几个例子,它们明确地证明了连接列表(数组)的元素比仅仅连接字符串快很多倍。为什么?

在这两种操作下工作的内部算法是什么?为什么一种算法比另一种更快?

下面是一个 Python 示例来说明我的意思:

# This is slow
x = 'a'
x += 'b'
...
x += 'z'

# This is fast
x = ['a', 'b', ... 'z']
x = ''.join(x)

I've seen several examples from different languages that unambiguously prove that joining elements of a list (array) is many times faster than just concatenating string. Why?

What is the inner algorithm that works under both operations and why is the one faster than another?

Here is a Python example of what I mean:

# This is slow
x = 'a'
x += 'b'
...
x += 'z'

# This is fast
x = ['a', 'b', ... 'z']
x = ''.join(x)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

顾铮苏瑾 2024-08-29 08:02:47

连接函数中的代码预先知道要求连接的所有字符串以及这些字符串有多大,因此它可以在开始操作之前计算最终的字符串长度。

因此,它只需要为最终字符串分配一次内存,然后就可以将每个源字符串(和分隔符)放置在内存中的正确位置。

另一方面,对字符串进行单个 += 操作别无选择,只能为最终字符串(即两个字符串的串联)分配足够的内存。后续的 += 必须执行相同的操作,每个分配的内存在下一个 += 上都将被丢弃。每次不断增长的字符串都会从内存中的一个位置复制到另一个位置。

The code in a join function knows upfront all the strings it’s being asked to concatenate and how large those strings are, and hence it can calculate the final string length before beginning the operation.

Hence it needs only allocate memory for the final string once and then it can place each source string (and delimiter) in the correct place in memory.

On the other hand, a single += operation on a string has no choice but to simply allocate enough memory for the final string which is the concatenation of just two strings. Subsequent +='s must do the same, each allocating memory which on the next += will be discarded. Each time the evergrowing string is copied from one place in memory to another.

旧时光的容颜 2024-08-29 08:02:47

原因是 Python(以及许多其他语言)中的字符串是 不可变对象 - 也就是说,一旦创建后,它们无法更改。相反,连接字符串实际上会生成一个新字符串,该字符串由连接的两个较小字符串的内容组成,然后用新字符串替换旧字符串。

由于创建字符串需要一定的时间(需要分配内存、将字符串的内容复制到该内存等),因此创建多个字符串比创建单个字符串需要更长的时间。进行 N 个串联需要在此过程中创建 N 个新字符串。另一方面,join() 只需创建一个字符串(最终结果),因此工作速度要快得多。

The reason is that strings in Python (and many other languages) are immutable objects - that is, once created, they can't be changed. Instead, concatenating a string actually makes a new string which consists of the contents of the two smaller strings being concatenated, and then replaces the old string with the new one.

Since creating a string takes a certain amount of time (need to allocate memory, copy the contents of the string to that memory, et cetera), making many strings takes longer than making a single string. Doing N concatenations requires creating N new strings in the process. join(), on the other hand, only has to create a single string (the final result) and thus works much faster.

平安喜乐 2024-08-29 08:02:47

这是因为必须为字符串连接分配越来越大的内存块:

x = 'a' # String of size 1 allocated
x += 'b' # String of size 2 allocated, x copied, and 'b' added. Old x discarded
x += 'b' # String of size 3 allocated, x copied, and 'c' added. Old x discarded
x += 'b' # String of size 4 allocated, x copied, and 'd' added. Old x discarded
x += 'b' # String of size 5 allocated, x copied, and 'e' added. Old x discarded

因此,会发生的情况是您执行大量分配和复制,但随后又将它们丢弃。非常浪费。

x = ['a', 'b', ..., 'z'] # 26 small allocations
x = ''.join(x) # A single, large allocation

This is because a larger and larger chunk of memory has to be allocated for the string concatenation:

x = 'a' # String of size 1 allocated
x += 'b' # String of size 2 allocated, x copied, and 'b' added. Old x discarded
x += 'b' # String of size 3 allocated, x copied, and 'c' added. Old x discarded
x += 'b' # String of size 4 allocated, x copied, and 'd' added. Old x discarded
x += 'b' # String of size 5 allocated, x copied, and 'e' added. Old x discarded

So what happens is you perform large allocations and copies, but then turn around and throw them away. Very wasteful.

x = ['a', 'b', ..., 'z'] # 26 small allocations
x = ''.join(x) # A single, large allocation
浮光之海 2024-08-29 08:02:47

请参阅Python字符串连接性能以及一个非常描述它的具体答案出色地:

建议是关于连接大量字符串。

计算 s = s1 + s2 + ... + sn,

  1. 使用+。创建一个新的字符串 s1+s2,然后创建一个新的字符串 s1+s2+s3,...,等等,因此涉及大量的内存分配和复制操作。事实上,s1 被复制了 n-1 次,s2 被复制了 n-2 次,...,等等

  2. 使用“”.join([s1,s2,...,sn])。连接是一次性完成的,字符串中的每个字符仅复制一次。

See Python string join performance and one specific answer that describes it very well:

The advice is about concatenating a lot of strings.

To compute s = s1 + s2 + ... + sn,

  1. using +. A new string s1+s2 is created, then a new string s1+s2+s3 is created,..., etc, so a lot of memory allocation and copy operations is involved. In fact, s1 is copied n-1 times, s2 is copied n-2 time, ..., etc.

  2. using "".join([s1,s2,...,sn]). The concatenation is done in one pass, and each char in the strings is copied only once.

A君 2024-08-29 08:02:47

其他回复基本上已经涵盖了它,但如果您想要更多详细信息,Joel Spolsky 有一篇文章,他描述了“Schlemiel 画家的算法",它非常相关,并且很好地说明了为什么即使您使用像 Python 这样的高级语言,理解这种低级实现细节仍然非常重要。

The other responses have basically covered it, but if you want even more detail, Joel Spolsky has an article where he describes "Schlemiel the painter's algorithm", which is extremely relevant and nicely makes the case for why understanding this sort of low level implementation detail is still very important even if you're working in a high level language like Python.

过潦 2024-08-29 08:02:47

我不知道 join 的内部原理,但在第一个版本中,每次调用 += 运算符时都会创建一个新字符串。由于字符串是不可变的,因此每次分配新内存并创建副本时。

现在,join(这是一个字符串方法)只能执行一次分配,因为它可以预先计算大小。

I don't know the internals of join, but in the first version you create a new string every time you call the += operator. Since strings are immutable, every time new memory is allocated and a copy is made.

Now, the join (which is a string method) could only do a single allocation, since it can calculate the size beforehand.

○闲身 2024-08-29 08:02:47

嗯,这在很大程度上依赖于语言,但总的来说,一个大操作比许多小操作更快。

在第二个示例中,连接知道它必须连接的所有元素,因此可以分配必要的资源并将字符放入。

第一个示例中的串联必须在每个步骤中重新分配资源(最坏的情况)。

Well, this is heavily language dependent, but in general the idea there is, that one big operation is faster than many small ones.

In your second example, the join knows all the elements that it has to join and thus can just allocate the necessary resources and put the characters in.

The concatenation in your first example has to reallocate resources at every single step (worst case).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文