为什么 Python 中减法比加法快?
我正在优化一些 Python 代码,并尝试了以下实验:
import time
start = time.clock()
x = 0
for i in range(10000000):
x += 1
end = time.clock()
print '+=',end-start
start = time.clock()
x = 0
for i in range(10000000):
x -= -1
end = time.clock()
print '-=',end-start
第二个循环确实更快,从须要到 10% 不等,具体取决于我运行它的系统。我尝试过改变循环的顺序、执行次数等,它似乎仍然有效。
陌生人
for i in range(10000000, 0, -1):
(即向后运行循环)比
for i in range(10000000):
循环内容相同时更快。
这里有什么更通用的编程课程吗?
I was optimising some Python code, and tried the following experiment:
import time
start = time.clock()
x = 0
for i in range(10000000):
x += 1
end = time.clock()
print '+=',end-start
start = time.clock()
x = 0
for i in range(10000000):
x -= -1
end = time.clock()
print '-=',end-start
The second loop is reliably faster, anywhere from a whisker to 10%, depending on the system I run it on. I've tried varying the order of the loops, number of executions etc, and it still seems to work.
Stranger,
for i in range(10000000, 0, -1):
(ie running the loop backwards) is faster than
for i in range(10000000):
even when loop contents are identical.
What gives, and is there a more general programming lesson here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我可以在我的 Q6600 (Python 2.6.2) 上重现这一点;将范围增加到 100000000:
首先,一些观察结果:
INPLACE_ADD
与INPLACE_SUBTRACT
以及 +1 与 -1。看看Python源码,我可以猜测一下。这是在 ceval.c 的
PyEval_EvalFrameEx
中处理的。INPLACE_ADD
有一个重要的额外代码块,用于处理字符串连接。INPLACE_SUBTRACT
中不存在该块,因为您无法减去字符串。这意味着 INPLACE_ADD 包含更多本机代码。 (很大程度上!)取决于编译器生成代码的方式,这些额外的代码可能会与 INPLACE_ADD 代码的其余部分内联,这意味着加法比减法更容易访问指令缓存。这可能会导致额外的二级缓存命中,从而导致显着的性能差异。这在很大程度上取决于您所在的系统(不同的处理器具有不同数量的缓存和缓存架构)、使用的编译器,包括特定版本和编译选项(不同的编译器将不同地决定哪些代码位位于关键位置)路径,它决定汇编代码如何组合在一起),等等。
此外,Python 3.0.1 中的差异相反(+:15.66,-:16.71);毫无疑问,这个关键功能已经发生了很大变化。
I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:
First, some observations:
INPLACE_ADD
vs.INPLACE_SUBTRACT
and +1 vs -1.Looking at the Python source, I can make a guess. This is handled in ceval.c, in
PyEval_EvalFrameEx
.INPLACE_ADD
has a significant extra block of code, to handle string concatenation. That block doesn't exist inINPLACE_SUBTRACT
, since you can't subtract strings. That meansINPLACE_ADD
contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.
Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.
您似乎存在一些测量偏差
Looks like you've some measurement bias
我认为“一般编程课程”是,仅通过查看源代码,确实很难预测哪个语句序列将是最快的。各个级别的程序员经常会陷入这种“直观”的优化之中。你认为你知道的不一定是真的。
没有什么可以替代实际测量您的程序性能。感谢您这样做;在这种情况下,回答为什么无疑需要深入研究Python的实现。
对于 Java、Python 和 .NET 等字节编译语言,仅仅在一台机器上测量性能是不够的。 VM 版本、本机代码翻译实现、特定于 CPU 的优化等之间的差异将使此类问题变得更加难以回答。
I think the "general programming lesson" is that it is really hard to predict, solely by looking at the source code, which sequence of statements will be the fastest. Programmers at all levels frequently get caught up by this sort of "intuitive" optimisation. What you think you know may not necessarily be true.
There is simply no substitute for actually measuring your program performance. Kudos for doing so; answering why undoubtedly requires delving deep into the implementation of Python, in this case.
With byte-compiled languages such as Java, Python, and .NET, it is not even sufficient to measure performance on just one machine. Differences between VM versions, native code translation implementations, CPU-specific optimisations, and so on will make this sort of question ever more tricky to answer.
这就是你的解释。重新排序脚本,以便首先对减法测试进行计时,然后进行加法,突然加法再次变得更快:
显然,脚本的后半部分发生了一些事情,使其速度更快。我的猜测是,第一次调用 range() 的速度较慢,因为 python 需要为这么长的列表分配足够的内存,但它能够重复使用该内存第二次调用
range()
:该脚本的几次运行表明,第一个
range()
所需的额外时间几乎占了'+=' 和 '-=' 见上文:That's your explanation right there. Re-order your script so the subtraction test is timed first, then the addition, and suddenly addition becomes the faster operation again:
Obviously something happens to the second half of the script that makes it faster. My guess is that the first call to
range()
is slower because python needs to allocate enough memory for such a long list, but it is able to re-use that memory for the second call torange()
:A few runs of this script show that the extra time needed for the first
range()
accounts for nearly all of the time difference between '+=' and '-=' seen above:提出问题时最好说明您正在使用什么平台和 Python 版本。有时这并不重要。这不是其中之一:
time.clock()
仅适用于 Windows。扔掉您自己的测量代码并使用-m timeit
(如 Pixelbeat 的答案中所示)。Python 2.X 的
range()
构建一个列表。如果您使用的是 Python 2.x,请将range
替换为xrange
,看看会发生什么。Python 3.X 的
int
是 Python2.X 的long
。It's always a good idea when asking a question to say what platform and what version of Python you are using. Sometimes it does't matter. This is NOT one of those times:
time.clock()
is appropriate only on Windows. Throw away your own measuring code and use-m timeit
as demonstrated in pixelbeat's answer.Python 2.X's
range()
builds a list. If you are using Python 2.x, replacerange
withxrange
and see what happens.Python 3.X's
int
is Python2.X'slong
.这里更普遍的编程教训是,在预测计算机代码的运行时性能时,直觉是一个糟糕的指导。
人们可以推断算法的复杂性、编译器优化的假设、估计缓存性能等等。然而,由于这些东西可以以非平凡的方式进行交互,因此确定特定代码段的速度的唯一方法是在目标环境中对其进行基准测试(正如您所做的那样)。
The more general programming lesson here is that intuition is a poor guide when predicting run-time performance of computer code.
One can reason about algorithmic complexity, hypothesise about compiler optimisations, estimate cache performance and so on. However, since these things can interact in non-trivial ways, the only way to be sure about how fast a particular piece of code is going to be is to benchmark it in the target environment (as you have rightfully done.)
对于 Python 2.5,这里最大的问题是使用 range,它将分配一个很大的列表来迭代它。当使用 xrange 时,无论第二个完成的对我来说都快一点。 (不确定 range 是否已成为 Python 3 中的生成器。)
With Python 2.5 the biggest problem here is using range, which will allocate a list that big to iterate over it. When using xrange, whichever is done second is a tiny bit faster for me. (Not sure if range has become a generator in Python 3.)
你的实验有问题。这个实验的设计方法是编写 2 个不同的程序 - 1 个用于加法,1 个用于减法。它们应该完全相同,并且在相同的条件下运行,并将数据写入文件。然后你需要对运行进行平均(至少几千次),但你需要统计学家告诉你一个合适的数字。
如果您想分析不同的加法、减法和循环方法,那么每个方法都应该是一个单独的程序。
实验错误可能是由处理器的热量和 CPU 上进行的其他活动引起的,因此我会以各种模式执行运行......
Your experiment is faulty. The way this experiment should be designed is to write 2 different programs - 1 for addition, 1 for subtraction. They should be exactly the same and run under the same conditions with the data being put to file. Then you need to average the runs (at least several thousand), but you'd need a statistician to tell you an appropriate number.
If you wanted to analyze different methods of addition, subtraction, and looping, again each of those should be a separate program.
Experimental error might arise from heat of processor and other activity going on the cpu, so i'd execute the runs in a variety of patterns...
这将是非常了不起的,所以我已经彻底评估了您的代码,并设置了实验,因为我发现它更正确(循环外的所有声明和函数调用)。这两个版本我都运行了五次。
-= 花费的时间不断减少;平均 3.6%
+= 平均(并不总是)减少 0.5% 的时间。
为了显示所有结果,我已将图放在网上:
所以,我得出结论,你的实验有偏差,而且是重要的。
最后这是我的代码:
That would be remarkable, so I have thoroughly evaluated your code and also setup the expiriment as I would find it more correct (all declarations and function calls outside the loop). Both versions I have run five times.
-= takes constantly less time; 3.6% on average
+= takes on average (not always) 0.5% less time.
To show all results I have put plots online:
So, I conclude that your experiment has a bias, and it is significant.
Finally here is my code:
向后运行循环速度更快,因为计算机可以更轻松地比较数字是否等于 0。
The running loop backwards is faster because the computer has an easier time comparing if a number is equal to 0.