为什么 Python 中减法比加法快?

发布于 2024-08-04 00:53:50 字数 540 浏览 16 评论 0原文

我正在优化一些 Python 代码,并尝试了以下实验:

import time

start = time.clock()
x = 0
for i in range(10000000):
    x += 1
end = time.clock()

print '+=',end-start

start = time.clock()
x = 0
for i in range(10000000):
    x -= -1
end = time.clock()

print '-=',end-start

第二个循环确实更快,从须要到 10% 不等,具体取决于我运行它的系统。我尝试过改变循环的顺序、执行次数等,它似乎仍然有效。

陌生人

for i in range(10000000, 0, -1):

(即向后运行循环)比

for i in range(10000000):

循环内容相同时更快。

这里有什么更通用的编程课程吗?

I was optimising some Python code, and tried the following experiment:

import time

start = time.clock()
x = 0
for i in range(10000000):
    x += 1
end = time.clock()

print '+=',end-start

start = time.clock()
x = 0
for i in range(10000000):
    x -= -1
end = time.clock()

print '-=',end-start

The second loop is reliably faster, anywhere from a whisker to 10%, depending on the system I run it on. I've tried varying the order of the loops, number of executions etc, and it still seems to work.

Stranger,

for i in range(10000000, 0, -1):

(ie running the loop backwards) is faster than

for i in range(10000000):

even when loop contents are identical.

What gives, and is there a more general programming lesson here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

彡翼 2024-08-11 00:53:50

我可以在我的 Q6600 (Python 2.6.2) 上重现这一点;将范围增加到 100000000:

('+=', 11.370000000000001)
('-=', 10.769999999999998)

首先,一些观察结果:

  • 对于一个简单的操作来说,这是 5%。这很重要。
  • 本机加法和减法操作码的速度无关紧要。它位于本底噪声中,与字节码评估相比完全相形见绌。这指的是大约数千条一两条本机指令。
  • 字节码生成完全相同数量的指令;唯一的区别是 INPLACE_ADDINPLACE_SUBTRACT 以及 +1 与 -1。

看看Python源码,我可以猜测一下。这是在 ceval.c 的 PyEval_EvalFrameEx 中处理的。 INPLACE_ADD 有一个重要的额外代码块,用于处理字符串连接。 INPLACE_SUBTRACT 中不存在该块,因为您无法减去字符串。这意味着 INPLACE_ADD 包含更多本机代码。 (很大程度上!)取决于编译器生成代码的方式,这些额外的代码可能会与 INPLACE_ADD 代码的其余部分内联,这意味着加法比减法更容易访问指令缓存。这可能会导致额外的二级缓存命中,从而导致显着的性能差异。

这在很大程度上取决于您所在的系统(不同的处理器具有不同数量的缓存和缓存架构)、使用的编译器,包括特定版本和编译选项(不同的编译器将不同地决定哪些代码位位于关键位置)路径,它决定汇编代码如何组合在一起),等等。

此外,Python 3.0.1 中的差异相反(+:15.66,-:16.71);毫无疑问,这个关键功能已经发生了很大变化。

I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:

('+=', 11.370000000000001)
('-=', 10.769999999999998)

First, some observations:

  • This is 5% for a trivial operation. That's significant.
  • The speed of the native addition and subtraction opcodes is irrelevant. It's in the noise floor, completely dwarfed by the bytecode evaluation. That's talking about one or two native instructions around thousands.
  • The bytecode generates exactly the same number of instructions; the only difference is INPLACE_ADD vs. INPLACE_SUBTRACT and +1 vs -1.

Looking at the Python source, I can make a guess. This is handled in ceval.c, in PyEval_EvalFrameEx. INPLACE_ADD has a significant extra block of code, to handle string concatenation. That block doesn't exist in INPLACE_SUBTRACT, since you can't subtract strings. That means INPLACE_ADD contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.

This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.

Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.

从﹋此江山别 2024-08-11 00:53:50
$ python -m timeit -s "x=0" "x+=1"
10000000 loops, best of 3: 0.151 usec per loop
$ python -m timeit -s "x=0" "x-=-1"
10000000 loops, best of 3: 0.154 usec per loop

您似乎存在一些测量偏差

$ python -m timeit -s "x=0" "x+=1"
10000000 loops, best of 3: 0.151 usec per loop
$ python -m timeit -s "x=0" "x-=-1"
10000000 loops, best of 3: 0.154 usec per loop

Looks like you've some measurement bias

凑诗 2024-08-11 00:53:50

我认为“一般编程课程”是,仅通过查看源代码,确实很难预测哪个语句序列将是最快的。各个级别的程序员经常会陷入这种“直观”的优化之中。你认为你知道的不一定是真的。

没有什么可以替代实际测量您的程序性能。感谢您这样做;在这种情况下,回答为什么无疑需要深入研究Python的实现。

对于 Java、Python 和 .NET 等字节编译语言,仅仅在一台机器上测量性能是不够的。 VM 版本、本机代码翻译实现、特定于 CPU 的优化等之间的差异将使此类问题变得更加难以回答。

I think the "general programming lesson" is that it is really hard to predict, solely by looking at the source code, which sequence of statements will be the fastest. Programmers at all levels frequently get caught up by this sort of "intuitive" optimisation. What you think you know may not necessarily be true.

There is simply no substitute for actually measuring your program performance. Kudos for doing so; answering why undoubtedly requires delving deep into the implementation of Python, in this case.

With byte-compiled languages such as Java, Python, and .NET, it is not even sufficient to measure performance on just one machine. Differences between VM versions, native code translation implementations, CPU-specific optimisations, and so on will make this sort of question ever more tricky to answer.

兰花执着 2024-08-11 00:53:50

“第二个循环确实更快......”

这就是你的解释。重新排序脚本,以便首先对减法测试进行计时,然后进行加法,突然加法再次变得更快:

-= 3.05
+= 2.84

显然,脚本的后半部分发生了一些事情,使其速度更快。我的猜测是,第一次调用 range() 的速度较慢,因为 python 需要为这么长的列表分配足够的内存,但它能够重复使用该内存第二次调用 range()

import time
start = time.clock()
x = range(10000000)
end = time.clock()
del x
print 'first range()',end-start
start = time.clock()
x = range(10000000)
end = time.clock()
print 'second range()',end-start

该脚本的几次运行表明,第一个 range() 所需的额外时间几乎占了'+=' 和 '-=' 见上文:

first range() 0.4
second range() 0.23

"The second loop is reliably faster ..."

That's your explanation right there. Re-order your script so the subtraction test is timed first, then the addition, and suddenly addition becomes the faster operation again:

-= 3.05
+= 2.84

Obviously something happens to the second half of the script that makes it faster. My guess is that the first call to range() is slower because python needs to allocate enough memory for such a long list, but it is able to re-use that memory for the second call to range():

import time
start = time.clock()
x = range(10000000)
end = time.clock()
del x
print 'first range()',end-start
start = time.clock()
x = range(10000000)
end = time.clock()
print 'second range()',end-start

A few runs of this script show that the extra time needed for the first range() accounts for nearly all of the time difference between '+=' and '-=' seen above:

first range() 0.4
second range() 0.23
相思故 2024-08-11 00:53:50

提出问题时最好说明您正在使用什么平台和 Python 版本。有时这并不重要。这不是其中之一:

  1. time.clock() 仅适用于 Windows。扔掉您自己的测量代码并使用 -m timeit (如 Pixelbeat 的答案中所示)。

  2. Python 2.X 的 range() 构建一个列表。如果您使用的是 Python 2.x,请将 range 替换为 xrange,看看会发生什么。

  3. Python 3.X 的 int 是 Python2.X 的 long

It's always a good idea when asking a question to say what platform and what version of Python you are using. Sometimes it does't matter. This is NOT one of those times:

  1. time.clock() is appropriate only on Windows. Throw away your own measuring code and use -m timeit as demonstrated in pixelbeat's answer.

  2. Python 2.X's range() builds a list. If you are using Python 2.x, replace range with xrange and see what happens.

  3. Python 3.X's int is Python2.X's long.

眼眸印温柔 2024-08-11 00:53:50

这里有更通用的编程课程吗?

这里更普遍的编程教训是,在预测计算机代码的运行时性能时,直觉是一个糟糕的指导。

人们可以推断算法的复杂性、编译器优化的假设、估计缓存性能等等。然而,由于这些东西可以以非平凡的方式进行交互,因此确定特定代码段的速度的唯一方法是在目标环境中对其进行基准测试(正如您所做的那样)。

Is there a more general programming lesson here?

The more general programming lesson here is that intuition is a poor guide when predicting run-time performance of computer code.

One can reason about algorithmic complexity, hypothesise about compiler optimisations, estimate cache performance and so on. However, since these things can interact in non-trivial ways, the only way to be sure about how fast a particular piece of code is going to be is to benchmark it in the target environment (as you have rightfully done.)

魔法少女 2024-08-11 00:53:50

对于 Python 2.5,这里最大的问题是使用 range,它将分配一个很大的列表来迭代它。当使用 xrange 时,无论第二个完成的对我来说都快一点。 (不确定 range 是否已成为 Python 3 中的生成器。)

With Python 2.5 the biggest problem here is using range, which will allocate a list that big to iterate over it. When using xrange, whichever is done second is a tiny bit faster for me. (Not sure if range has become a generator in Python 3.)

烟沫凡尘 2024-08-11 00:53:50

你的实验有问题。这个实验的设计方法是编写 2 个不同的程序 - 1 个用于加法,1 个用于减法。它们应该完全相同,并且在相同的条件下运行,并将数据写入文件。然后你需要对运行进行平均(至少几千次),但你需要统计学家告诉你一个合适的数字。

如果您想分析不同的加法、减法和循环方法,那么每个方法都应该是一个单独的程序。

实验错误可能是由处理器的热量和 CPU 上进行的其他活动引起的,因此我会以各种模式执行运行......

Your experiment is faulty. The way this experiment should be designed is to write 2 different programs - 1 for addition, 1 for subtraction. They should be exactly the same and run under the same conditions with the data being put to file. Then you need to average the runs (at least several thousand), but you'd need a statistician to tell you an appropriate number.

If you wanted to analyze different methods of addition, subtraction, and looping, again each of those should be a separate program.

Experimental error might arise from heat of processor and other activity going on the cpu, so i'd execute the runs in a variety of patterns...

傲性难收 2024-08-11 00:53:50

这将是非常了不起的,所以我已经彻底评估了您的代码,并设置了实验,因为我发现它更正确(循环外的所有声明和函数调用)。这两个版本我都运行了五次。

  • 运行您的代码验证了您的声明:
    -= 花费的时间不断减少;平均 3.6%
  • 但是,运行我的代码与您的实验结果相矛盾:
    += 平均(并不总是)减少 0.5% 的时间。

为了显示所有结果,我已将图放在网上:

所以,我得出结论,你的实验有偏差,而且是重要的。

最后这是我的代码:

import time

addtimes = [0.] * 100
subtracttimes = [0.] * 100

range100 = range(100)
range10000000 = range(10000000)

j = 0
i = 0
x = 0
start = 0.


for j in range100:
 start = time.clock()
 x = 0
 for i in range10000000:
  x += 1
 addtimes[j] = time.clock() - start

for j in range100:
 start = time.clock()
 x = 0
 for i in range10000000:
  x -= -1
 subtracttimes[j] = time.clock() - start

print '+=', sum(addtimes)
print '-=', sum(subtracttimes)

That would be remarkable, so I have thoroughly evaluated your code and also setup the expiriment as I would find it more correct (all declarations and function calls outside the loop). Both versions I have run five times.

  • Running your code validated your claims:
    -= takes constantly less time; 3.6% on average
  • Running my code, though, contradicts the outcome of your experiment:
    += takes on average (not always) 0.5% less time.

To show all results I have put plots online:

So, I conclude that your experiment has a bias, and it is significant.

Finally here is my code:

import time

addtimes = [0.] * 100
subtracttimes = [0.] * 100

range100 = range(100)
range10000000 = range(10000000)

j = 0
i = 0
x = 0
start = 0.


for j in range100:
 start = time.clock()
 x = 0
 for i in range10000000:
  x += 1
 addtimes[j] = time.clock() - start

for j in range100:
 start = time.clock()
 x = 0
 for i in range10000000:
  x -= -1
 subtracttimes[j] = time.clock() - start

print '+=', sum(addtimes)
print '-=', sum(subtracttimes)
江湖彼岸 2024-08-11 00:53:50

向后运行循环速度更快,因为计算机可以更轻松地比较数字是否等于 0。

The running loop backwards is faster because the computer has an easier time comparing if a number is equal to 0.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文