C# 为 ++ 生成了 IL运算符 - 前缀/后缀表示法何时以及为何更快

发布于 2024-11-08 13:32:29 字数 2207 浏览 4 评论 0 原文

由于这个问题是关于增量运算符和前缀/后缀表示法的速度差异,我将非常仔细地描述这个问题,以免 Eric Lippert 发现它并激怒我!

(有关我询问原因的更多信息和更多详细信息,请访问 http:// www.codeproject.com/KB/cs/FastLessCSharpIteration.aspx?msg=3899456#xx3899456xx/

我有四个代码片段如下:-(

1)单独,前缀:

    for (var j = 0; j != jmax;) { total += intArray[j]; ++j; }

(2)单独,后缀

    for (var j = 0; j != jmax;) { total += intArray[j]; j++; }

:( 3)索引器,后缀:

    for (var j = 0; j != jmax;) { total += intArray[j++]; }

(4)索引器,前缀:

    for (var j = -1; j != last;) { total += intArray[++j]; } // last = jmax - 1

我试图做的是证明/反驳在此上下文中前缀和后缀表示法之间是否存在性能差异(即局部变量,因此不是易失性的,不能从另一个变量更改)线程等),如果有的话,为什么会这样。

速度测试表明:

  • (1)和(2)彼此运行速度相同。

  • (3)和(4)以彼此相同的速度运行。

  • (3)/(4) 比 (1)/(2) 慢约 27%。

因此,我得出的结论是,选择前缀表示法相对于后缀表示法本身并没有性能优势。然而,当实际使用操作的结果时,这会导致比简单地丢弃代码更慢的代码。

然后,我使用 Reflector 查看了生成的 IL,发现了以下内容:

  • IL 字节数在所有情况下都是相同的。

  • .maxstack 在 4 和 6 之间变化,但我相信它仅用于验证目的,因此与性能无关。

  • (1) 和 (2) 生成完全相同的 IL,因此时序相同也就不足为奇了。因此我们可以忽略 (1)。

  • (3) 和 (4) 生成了非常相似的代码 - 唯一相关的区别是重复操作码的位置以说明操作的结果。同样,时序相同也就不足为奇了。

因此,我随后比较了 (2) 和 (3),以找出导致速度差异的原因:

  • (2) 使用 ldloc.0 操作两次(一次作为索引器的一部分,然后作为索引器的一部分) 。

  • (3) 使用 ldloc.0,后立即执行 dup 操作。

因此,(1)(和(2))的 j 递增的相关 IL 是:

// ldloc.0 already used once for the indexer operation higher up
ldloc.0
ldc.i4.1
add
stloc.0

(3) 看起来像这样:

ldloc.0
dup // j on the stack for the *Result of the Operation*
ldc.i4.1
add
stloc.0

(4) 看起来像这样:

ldloc.0
ldc.i4.1
add
dup // j + 1 on the stack for the *Result of the Operation*
stloc.0

现在(最后!)回答问题:

(2)更快,因为JIT 编译器将 ldloc.0/ldc.i4.1/add/stloc.0 模式识别为简单地将局部变量递增 1 并对其进行优化? (并且(3)和(4)中的 dup 的存在打破了该模式,因此错过了优化)

以及补充: 如果这是真的,那么至少对于 (3),用另一个 ldloc.0 替换 dup 不会重新引入该模式吗?

Since this question is about the increment operator and speed differences with prefix/postfix notation, I will describe the question very carefully lest Eric Lippert discover it and flame me!

(further info and more detail on why I am asking can be found at http://www.codeproject.com/KB/cs/FastLessCSharpIteration.aspx?msg=3899456#xx3899456xx/)

I have four snippets of code as follows:-

(1) Separate, Prefix:

    for (var j = 0; j != jmax;) { total += intArray[j]; ++j; }

(2) Separate, Postfix:

    for (var j = 0; j != jmax;) { total += intArray[j]; j++; }

(3) Indexer, Postfix:

    for (var j = 0; j != jmax;) { total += intArray[j++]; }

(4) Indexer, Prefix:

    for (var j = -1; j != last;) { total += intArray[++j]; } // last = jmax - 1

What I was trying to do was prove/disprove whether there is a performance difference between prefix and postfix notation in this context (ie a local variable so not volatile, not changeable from another thread etc.) and if there was, why that would be.

Speed testing showed that:

  • (1) and (2) run at the same speed as each other.

  • (3) and (4) run at the same speed as each other.

  • (3)/(4) are ~27% slower than (1)/(2).

Therefore I am concluding that there is no performance advantage of choosing prefix notation over postfix notation per se. However when the Result of the Operation is actually used, then this results in slower code than if it is simply thrown away.

I then had a look at the generated IL using Reflector and found the following:

  • The number of IL bytes is identical in all cases.

  • The .maxstack varied between 4 and 6 but I believe that is used only for verification purposes and so not relevant to performance.

  • (1) and (2) generated exactly the same IL so its no surprise that the timing was identical. So we can ignore (1).

  • (3) and (4) generated very similar code - the only relevant difference being the positioning of a dup opcode to account for the Result of the Operation. Again, no surprise about timing being identical.

So I then compared (2) and (3) to find out what could account for the difference in speed:

  • (2) uses a ldloc.0 op twice (once as part of the indexer and then later as part of the increment).

  • (3) used ldloc.0 followed immediately by a dup op.

So the relevant IL for the incrementing j for (1) (and (2)) is:

// ldloc.0 already used once for the indexer operation higher up
ldloc.0
ldc.i4.1
add
stloc.0

(3) looks like this:

ldloc.0
dup // j on the stack for the *Result of the Operation*
ldc.i4.1
add
stloc.0

(4) looks like this:

ldloc.0
ldc.i4.1
add
dup // j + 1 on the stack for the *Result of the Operation*
stloc.0

Now (finally!) to the question:

Is (2) faster because the JIT compiler recognises a pattern of ldloc.0/ldc.i4.1/add/stloc.0 as simply incrementing a local variable by 1 and optimize it?
(and the presence of a dup in (3) and (4) break that pattern and so the optimization is missed)

And a supplementary:
If this is true then, for (3) at least, wouldn't replacing the dup with another ldloc.0 reintroduce that pattern?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

卸妝后依然美 2024-11-15 13:32:29

好吧,经过大量研究(我知道很难过!),我想已经回答了我自己的问题:

答案是也许。
显然,JIT 编译器确实会寻找模式(请参阅 http://blogs.msdn.com/b/clrcode Generation/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx)来决定何时以及如何优化数组边界检查,但我不知道它是否与我猜测的模式相同。

在这种情况下,这是一个有争议的问题,因为(2)的相对速度增加的原因不止于此。事实证明,x64 JIT 编译器足够聪明,可以计算出数组长度是否恒定(并且似乎也是循环中展开次数的倍数):因此,代码仅在每次迭代结束时进行边界检查,并且每次展开都变得只是:-

        total += intArray[j]; j++;
00000081 8B 44 0B 10          mov         eax,dword ptr [rbx+rcx+10h] 
00000085 03 F0                add         esi,eax 

我通过更改应用程序以让数组大小在命令行上指定并查看不同的汇编器输出来证明这一点。

在此练习中发现的其他事情: -

  • 对于独立增量操作(即不使用结果),前缀/后缀之间的速度没有差异。
  • 当在索引器中使用增量操作时,汇编器显示前缀表示法稍微更有效(并且在原始情况下非常接近,我认为这只是时间差异并将它们称为相等 - 我的错误)。当编译为 x86 时,差异更加明显。
  • 循环展开确实有效。与具有数组边界优化的标准循环相比,4 次汇总始终提供 10%-20% 的改进(x64/常量情况为 34%)。增加汇总数量会带来不同的计时,在索引器中有后缀的情况下,有些时间会慢得多,因此如果展开,我将坚持使用 4,并且仅在特定情况的大量计时后才更改它。

OK after much research (sad I know!), I think have answered my own question:

The answer is Maybe.
Apparently the JIT compilers do look for patterns (see http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx) to decide when and how array bounds checking can be optimized but whether it is the same pattern I was guessing at or not I don't know.

In this case, it is a moot point because the relative speed increase of (2) was due to something more than that. Turns out that the x64 JIT compiler is clever enough to work out whether an array length is constant (and seemingly also a multiple of the number of unrolls in a loop): So the code was only bounds checking at the end of each iteration and the each unroll became just:-

        total += intArray[j]; j++;
00000081 8B 44 0B 10          mov         eax,dword ptr [rbx+rcx+10h] 
00000085 03 F0                add         esi,eax 

I proved this by changing the app to let the array size be specified on the command line and seeing the different assembler output.

Other things discovered during this excercise:-

  • For a standalone increment operation (ie the result is not used), there is no difference in speed between prefix/postfix.
  • When an increment operation is used in an indexer, the assembler shows that prefix notation is slightly more efficient (and so close in the the original case that I assumed it was just a timing discrepency and called them equal - my mistake). The difference is more pronounced when compiled as x86.
  • Loop unrolling does work. Compared to a standard loop with array bounds optimization, 4 rollups always gave an improvement of 10%-20% (and the x64/constant case 34%). Increasing the number of rollups gave varied timing with some very much slower in the case of a postfix in the indexer, so I'll stick with 4 if unrolling and only change that after extensive timing for a specific case.
森林很绿却致人迷途 2024-11-15 13:32:29

有趣的结果。我要做的是:

  • 重写应用程序以将整个测试进行两次。
  • 在两次测试运行之间放置一个消息框。
  • 编译发布,没有优化等等。
  • 在调试器外部启动可执行文件
  • 当消息框出现时,连接调试器
  • 现在检查抖动为两种不同情况生成的代码。

然后您就会知道其中一个的抖动是否比另一个更好。例如,抖动可能意识到在一种情况下它可以删除数组边界检查,但在另一种情况下却没有意识到这一点。我不知道;我不是抖动方面的专家。

所有这些繁琐的原因是因为附加调试器时抖动可能会生成不同的代码。如果您想知道它在正常情况下的作用,那么您必须确保代码在正常的非调试器情况下得到抖动。

Interesting results. What I would do is:

  • Rewrite the application to do the whole test twice.
  • Put a message box between the two test runs.
  • Compile for release, no optimizations, and so on.
  • Start the executable outside of the debugger.
  • When the message box comes up, attach the debugger
  • Now inspect the code generated for the two different cases by the jitter.

And then you'll know whether the jitter is doing a better job with one than the other. The jitter might, for example, be realizing that in one case it can remove array bounds checks, but not realizing that in the other case. I don't know; I'm not an expert on the jitter.

The reason for all the rigamarole is because the jitter may generate different code when the debugger is attached. If you want to know what it does under normal circumstances then you have to make sure the code gets jitted under normal, non-debugger circumstances.

一人独醉 2024-11-15 13:32:29

我喜欢性能测试,也喜欢快速程序,所以我很欣赏你的问题。

我试图重现你的发现但失败了。在我的 Intel i7 x64 系统上,在 x86|Release 配置中的 .NET4 框架上运行代码示例,所有四个测试用例产生的时序大致相同。

为了进行测试,我创建了一个全新的控制台应用程序项目并使用 QueryPerformanceCounter API 调用来获取基于 CPU 的高分辨率计时器。我尝试了 jmax 的两种设置:

  • jmax = 1000
  • jmax = 1000000

因为数组的局部性通常会对性能表现产生很大影响并且 of 循环的大小会增加。但是,在我的测试中,两个数组大小的行为相同。

我已经做了很多性能优化,我学到的一件事是,您可以非常轻松地优化应用程序,使其在一台特定计算机上运行得更快,同时无意中导致它在另一台电脑。

我在这里不是在假设。我调整了内部循环,投入了数小时和数天的工作,以使程序运行得更快,但我的希望却破灭了,因为我在我的工作站上优化它,而目标计算机是不同型号的英特尔处理器。

所以这个故事的寓意是:

  • 代码片段 (2) 在您的计算机上比代码片段 (3) 运行得更快,但在我的计算机上却不然

这就是为什么某些编译器针对不同的处理器有特殊的优化开关,或者某些应用程序有不同的版本,即使一个版本可以轻松地在所有支持的硬件上运行。

因此,如果您要进行这样的测试,则必须采用与 JIT 编译器编写者相同的方式:您必须在各种硬件上执行测试,然后选择一个混合,一个happy-medium 可在最普遍的硬件上提供最佳性能。

I love performance testing and I love fast programs so I admire your question.

I tried to reproduce your findings and failed. On my Intel i7 x64 system running your code samples on .NET4 framework in the x86|Release configuration, all four test cases produced roughly the same timings.

To do the test I created a brand new console application project and used the QueryPerformanceCounter API call to get a high-resolution CPU-based timer. I tried two settings for jmax:

  • jmax = 1000
  • jmax = 1000000

because locality of the array can often make a big difference in how the performance behaves and the size of the of loop increases. However, both array sizes behaved the same in my tests.

I have done a lot of performance optimization and one of the things that I have learned is that you can very easily optimize an application so that it runs faster on one particular computer while inadvertently causing it to run slower on another computer.

I am not talking hypothetically here. I have tweaked inner loops and poured hours and days of work to make a program run faster, only to have my hopes dashed because I was optimizing it on my workstation and the target computer was a different model of Intel processor.

So the moral of this story is:

  • Code snippet (2) runs faster than code snippet (3) on your computer but not on my computer

This is why some compilers have special optimization switches for different processors or some applications come in different versions even though one version could easily run on all supported hardware.

So if you are going to do testing like this, you have to do it same way that JIT compiler writers do: you have to perform your tests on a wide variety of hardware and then choose a blend, a happy-medium that gives the best performance on the most ubiquitous hardware.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文