由于这个问题是关于增量运算符和前缀/后缀表示法的速度差异,我将非常仔细地描述这个问题,以免 Eric Lippert 发现它并激怒我!
(有关我询问原因的更多信息和更多详细信息,请访问 http:// www.codeproject.com/KB/cs/FastLessCSharpIteration.aspx?msg=3899456#xx3899456xx/)
我有四个代码片段如下:-(
1)单独,前缀:
for (var j = 0; j != jmax;) { total += intArray[j]; ++j; }
(2)单独,后缀
for (var j = 0; j != jmax;) { total += intArray[j]; j++; }
:( 3)索引器,后缀:
for (var j = 0; j != jmax;) { total += intArray[j++]; }
(4)索引器,前缀:
for (var j = -1; j != last;) { total += intArray[++j]; } // last = jmax - 1
我试图做的是证明/反驳在此上下文中前缀和后缀表示法之间是否存在性能差异(即局部变量,因此不是易失性的,不能从另一个变量更改)线程等),如果有的话,为什么会这样。
速度测试表明:
因此,我得出的结论是,选择前缀表示法相对于后缀表示法本身并没有性能优势。然而,当实际使用操作的结果时,这会导致比简单地丢弃代码更慢的代码。
然后,我使用 Reflector 查看了生成的 IL,发现了以下内容:
-
IL 字节数在所有情况下都是相同的。
-
.maxstack 在 4 和 6 之间变化,但我相信它仅用于验证目的,因此与性能无关。
-
(1) 和 (2) 生成完全相同的 IL,因此时序相同也就不足为奇了。因此我们可以忽略 (1)。
-
(3) 和 (4) 生成了非常相似的代码 - 唯一相关的区别是重复操作码的位置以说明操作的结果。同样,时序相同也就不足为奇了。
因此,我随后比较了 (2) 和 (3),以找出导致速度差异的原因:
因此,(1)(和(2))的 j 递增的相关 IL 是:
// ldloc.0 already used once for the indexer operation higher up
ldloc.0
ldc.i4.1
add
stloc.0
(3) 看起来像这样:
ldloc.0
dup // j on the stack for the *Result of the Operation*
ldc.i4.1
add
stloc.0
(4) 看起来像这样:
ldloc.0
ldc.i4.1
add
dup // j + 1 on the stack for the *Result of the Operation*
stloc.0
现在(最后!)回答问题:
(2)更快,因为JIT 编译器将 ldloc.0/ldc.i4.1/add/stloc.0 模式识别为简单地将局部变量递增 1 并对其进行优化?
(并且(3)和(4)中的 dup
的存在打破了该模式,因此错过了优化)
以及补充:
如果这是真的,那么至少对于 (3),用另一个 ldloc.0
替换 dup
不会重新引入该模式吗?
Since this question is about the increment operator and speed differences with prefix/postfix notation, I will describe the question very carefully lest Eric Lippert discover it and flame me!
(further info and more detail on why I am asking can be found at http://www.codeproject.com/KB/cs/FastLessCSharpIteration.aspx?msg=3899456#xx3899456xx/)
I have four snippets of code as follows:-
(1) Separate, Prefix:
for (var j = 0; j != jmax;) { total += intArray[j]; ++j; }
(2) Separate, Postfix:
for (var j = 0; j != jmax;) { total += intArray[j]; j++; }
(3) Indexer, Postfix:
for (var j = 0; j != jmax;) { total += intArray[j++]; }
(4) Indexer, Prefix:
for (var j = -1; j != last;) { total += intArray[++j]; } // last = jmax - 1
What I was trying to do was prove/disprove whether there is a performance difference between prefix and postfix notation in this context (ie a local variable so not volatile, not changeable from another thread etc.) and if there was, why that would be.
Speed testing showed that:
-
(1) and (2) run at the same speed as each other.
-
(3) and (4) run at the same speed as each other.
-
(3)/(4) are ~27% slower than (1)/(2).
Therefore I am concluding that there is no performance advantage of choosing prefix notation over postfix notation per se. However when the Result of the Operation is actually used, then this results in slower code than if it is simply thrown away.
I then had a look at the generated IL using Reflector and found the following:
-
The number of IL bytes is identical in all cases.
-
The .maxstack varied between 4 and 6 but I believe that is used only for verification purposes and so not relevant to performance.
-
(1) and (2) generated exactly the same IL so its no surprise that the timing was identical. So we can ignore (1).
-
(3) and (4) generated very similar code - the only relevant difference being the positioning of a dup opcode to account for the Result of the Operation. Again, no surprise about timing being identical.
So I then compared (2) and (3) to find out what could account for the difference in speed:
So the relevant IL for the incrementing j for (1) (and (2)) is:
// ldloc.0 already used once for the indexer operation higher up
ldloc.0
ldc.i4.1
add
stloc.0
(3) looks like this:
ldloc.0
dup // j on the stack for the *Result of the Operation*
ldc.i4.1
add
stloc.0
(4) looks like this:
ldloc.0
ldc.i4.1
add
dup // j + 1 on the stack for the *Result of the Operation*
stloc.0
Now (finally!) to the question:
Is (2) faster because the JIT compiler recognises a pattern of ldloc.0/ldc.i4.1/add/stloc.0
as simply incrementing a local variable by 1 and optimize it?
(and the presence of a dup
in (3) and (4) break that pattern and so the optimization is missed)
And a supplementary:
If this is true then, for (3) at least, wouldn't replacing the dup
with another ldloc.0
reintroduce that pattern?
发布评论
评论(3)
好吧,经过大量研究(我知道很难过!),我想已经回答了我自己的问题:
答案是也许。
显然,JIT 编译器确实会寻找模式(请参阅 http://blogs.msdn.com/b/clrcode Generation/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx)来决定何时以及如何优化数组边界检查,但我不知道它是否与我猜测的模式相同。
在这种情况下,这是一个有争议的问题,因为(2)的相对速度增加的原因不止于此。事实证明,x64 JIT 编译器足够聪明,可以计算出数组长度是否恒定(并且似乎也是循环中展开次数的倍数):因此,代码仅在每次迭代结束时进行边界检查,并且每次展开都变得只是:-
我通过更改应用程序以让数组大小在命令行上指定并查看不同的汇编器输出来证明这一点。
在此练习中发现的其他事情: -
OK after much research (sad I know!), I think have answered my own question:
The answer is Maybe.
Apparently the JIT compilers do look for patterns (see http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx) to decide when and how array bounds checking can be optimized but whether it is the same pattern I was guessing at or not I don't know.
In this case, it is a moot point because the relative speed increase of (2) was due to something more than that. Turns out that the x64 JIT compiler is clever enough to work out whether an array length is constant (and seemingly also a multiple of the number of unrolls in a loop): So the code was only bounds checking at the end of each iteration and the each unroll became just:-
I proved this by changing the app to let the array size be specified on the command line and seeing the different assembler output.
Other things discovered during this excercise:-
有趣的结果。我要做的是:
然后您就会知道其中一个的抖动是否比另一个更好。例如,抖动可能意识到在一种情况下它可以删除数组边界检查,但在另一种情况下却没有意识到这一点。我不知道;我不是抖动方面的专家。
所有这些繁琐的原因是因为附加调试器时抖动可能会生成不同的代码。如果您想知道它在正常情况下的作用,那么您必须确保代码在正常的非调试器情况下得到抖动。
Interesting results. What I would do is:
And then you'll know whether the jitter is doing a better job with one than the other. The jitter might, for example, be realizing that in one case it can remove array bounds checks, but not realizing that in the other case. I don't know; I'm not an expert on the jitter.
The reason for all the rigamarole is because the jitter may generate different code when the debugger is attached. If you want to know what it does under normal circumstances then you have to make sure the code gets jitted under normal, non-debugger circumstances.
我喜欢性能测试,也喜欢快速程序,所以我很欣赏你的问题。
我试图重现你的发现但失败了。在我的 Intel i7 x64 系统上,在 x86|Release 配置中的 .NET4 框架上运行代码示例,所有四个测试用例产生的时序大致相同。
为了进行测试,我创建了一个全新的控制台应用程序项目并使用
QueryPerformanceCounter
API 调用来获取基于 CPU 的高分辨率计时器。我尝试了
jmax
的两种设置:jmax = 1000
jmax = 1000000
,因为数组的局部性通常会对性能表现产生很大影响并且 of 循环的大小会增加。但是,在我的测试中,两个数组大小的行为相同。
我已经做了很多性能优化,我学到的一件事是,您可以非常轻松地优化应用程序,使其在一台特定计算机上运行得更快,同时无意中导致它在另一台电脑。
我在这里不是在假设。我调整了内部循环,投入了数小时和数天的工作,以使程序运行得更快,但我的希望却破灭了,因为我在我的工作站上优化它,而目标计算机是不同型号的英特尔处理器。
所以这个故事的寓意是:
这就是为什么某些编译器针对不同的处理器有特殊的优化开关,或者某些应用程序有不同的版本,即使一个版本可以轻松地在所有支持的硬件上运行。
因此,如果您要进行这样的测试,则必须采用与 JIT 编译器编写者相同的方式:您必须在各种硬件上执行测试,然后选择一个混合,一个happy-medium 可在最普遍的硬件上提供最佳性能。
I love performance testing and I love fast programs so I admire your question.
I tried to reproduce your findings and failed. On my Intel i7 x64 system running your code samples on .NET4 framework in the x86|Release configuration, all four test cases produced roughly the same timings.
To do the test I created a brand new console application project and used the
QueryPerformanceCounter
API call to get a high-resolution CPU-based timer. I tried two settings forjmax
:jmax = 1000
jmax = 1000000
because locality of the array can often make a big difference in how the performance behaves and the size of the of loop increases. However, both array sizes behaved the same in my tests.
I have done a lot of performance optimization and one of the things that I have learned is that you can very easily optimize an application so that it runs faster on one particular computer while inadvertently causing it to run slower on another computer.
I am not talking hypothetically here. I have tweaked inner loops and poured hours and days of work to make a program run faster, only to have my hopes dashed because I was optimizing it on my workstation and the target computer was a different model of Intel processor.
So the moral of this story is:
This is why some compilers have special optimization switches for different processors or some applications come in different versions even though one version could easily run on all supported hardware.
So if you are going to do testing like this, you have to do it same way that JIT compiler writers do: you have to perform your tests on a wide variety of hardware and then choose a blend, a happy-medium that gives the best performance on the most ubiquitous hardware.