当前位置：文江博客话题详情

C# performance optimization il postfix-notation

C# 为 ++ 生成了 IL运算符 - 前缀/后缀表示法何时以及为何更快

发布于 2024-11-08 13:32:29 字数 2207 浏览 4 评论 0 原文

由于这个问题是关于增量运算符和前缀/后缀表示法的速度差异，我将非常仔细地描述这个问题，以免 Eric Lippert 发现它并激怒我！

（有关我询问原因的更多信息和更多详细信息，请访问 http:// www.codeproject.com/KB/cs/FastLessCSharpIteration.aspx?msg=3899456#xx3899456xx/）

我有四个代码片段如下：-（

1）单独，前缀：

    for (var j = 0; j != jmax;) { total += intArray[j]; ++j; }

（2）单独，后缀

    for (var j = 0; j != jmax;) { total += intArray[j]; j++; }

：（ 3）索引器，后缀：

    for (var j = 0; j != jmax;) { total += intArray[j++]; }

（4）索引器，前缀：

    for (var j = -1; j != last;) { total += intArray[++j]; } // last = jmax - 1

我试图做的是证明/反驳在此上下文中前缀和后缀表示法之间是否存在性能差异（即局部变量，因此不是易失性的，不能从另一个变量更改）线程等），如果有的话，为什么会这样。

速度测试表明：

（1）和（2）彼此运行速度相同。
（3）和（4）以彼此相同的速度运行。
(3)/(4) 比 (1)/(2) 慢约 27%。

因此，我得出的结论是，选择前缀表示法相对于后缀表示法本身并没有性能优势。然而，当实际使用操作的结果时，这会导致比简单地丢弃代码更慢的代码。

然后，我使用 Reflector 查看了生成的 IL，发现了以下内容：

IL 字节数在所有情况下都是相同的。
.maxstack 在 4 和 6 之间变化，但我相信它仅用于验证目的，因此与性能无关。
(1) 和 (2) 生成完全相同的 IL，因此时序相同也就不足为奇了。因此我们可以忽略 (1)。
(3) 和 (4) 生成了非常相似的代码 - 唯一相关的区别是重复操作码的位置以说明操作的结果。同样，时序相同也就不足为奇了。

因此，我随后比较了 (2) 和 (3)，以找出导致速度差异的原因：

(2) 使用 ldloc.0 操作两次（一次作为索引器的一部分，然后作为索引器的一部分）。
(3) 使用 ldloc.0，后立即执行 dup 操作。

因此，(1)（和（2））的 j 递增的相关 IL 是：

// ldloc.0 already used once for the indexer operation higher up
ldloc.0
ldc.i4.1
add
stloc.0

(3) 看起来像这样：

ldloc.0
dup // j on the stack for the *Result of the Operation*
ldc.i4.1
add
stloc.0

(4) 看起来像这样：

ldloc.0
ldc.i4.1
add
dup // j + 1 on the stack for the *Result of the Operation*
stloc.0

现在（最后！）回答问题：

（2）更快，因为JIT 编译器将 ldloc.0/ldc.i4.1/add/stloc.0 模式识别为简单地将局部变量递增 1 并对其进行优化？（并且（3）和（4）中的 dup 的存在打破了该模式，因此错过了优化）

以及补充：如果这是真的，那么至少对于 (3)，用另一个 ldloc.0 替换 dup 不会重新引入该模式吗？

原文

Since this question is about the increment operator and speed differences with prefix/postfix notation, I will describe the question very carefully lest Eric Lippert discover it and flame me!

(further info and more detail on why I am asking can be found at http://www.codeproject.com/KB/cs/FastLessCSharpIteration.aspx?msg=3899456#xx3899456xx/)

I have four snippets of code as follows:-

(1) Separate, Prefix:

    for (var j = 0; j != jmax;) { total += intArray[j]; ++j; }

(2) Separate, Postfix:

    for (var j = 0; j != jmax;) { total += intArray[j]; j++; }

(3) Indexer, Postfix:

    for (var j = 0; j != jmax;) { total += intArray[j++]; }

(4) Indexer, Prefix:

    for (var j = -1; j != last;) { total += intArray[++j]; } // last = jmax - 1

What I was trying to do was prove/disprove whether there is a performance difference between prefix and postfix notation in this context (ie a local variable so not volatile, not changeable from another thread etc.) and if there was, why that would be.

Speed testing showed that:

(1) and (2) run at the same speed as each other.
(3) and (4) run at the same speed as each other.
(3)/(4) are ~27% slower than (1)/(2).

Therefore I am concluding that there is no performance advantage of choosing prefix notation over postfix notation per se. However when the Result of the Operation is actually used, then this results in slower code than if it is simply thrown away.

I then had a look at the generated IL using Reflector and found the following:

The number of IL bytes is identical in all cases.
The .maxstack varied between 4 and 6 but I believe that is used only for verification purposes and so not relevant to performance.
(1) and (2) generated exactly the same IL so its no surprise that the timing was identical. So we can ignore (1).
(3) and (4) generated very similar code - the only relevant difference being the positioning of a dup opcode to account for the Result of the Operation. Again, no surprise about timing being identical.

So I then compared (2) and (3) to find out what could account for the difference in speed:

(2) uses a ldloc.0 op twice (once as part of the indexer and then later as part of the increment).
(3) used ldloc.0 followed immediately by a dup op.

So the relevant IL for the incrementing j for (1) (and (2)) is:

// ldloc.0 already used once for the indexer operation higher up
ldloc.0
ldc.i4.1
add
stloc.0

(3) looks like this:

ldloc.0
dup // j on the stack for the *Result of the Operation*
ldc.i4.1
add
stloc.0

(4) looks like this:

ldloc.0
ldc.i4.1
add
dup // j + 1 on the stack for the *Result of the Operation*
stloc.0

Now (finally!) to the question:

Is (2) faster because the JIT compiler recognises a pattern of ldloc.0/ldc.i4.1/add/stloc.0 as simply incrementing a local variable by 1 and optimize it?
(and the presence of a dup in (3) and (4) break that pattern and so the optimization is missed)

And a supplementary:
If this is true then, for (3) at least, wouldn't replacing the dup with another ldloc.0 reintroduce that pattern?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

卸妝后依然美 2024-11-15 13:32:29

好吧，经过大量研究（我知道很难过！），我想已经回答了我自己的问题：

答案是也许。
显然，JIT 编译器确实会寻找模式（请参阅 http://blogs.msdn.com/b/clrcode Generation/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx）来决定何时以及如何优化数组边界检查，但我不知道它是否与我猜测的模式相同。

在这种情况下，这是一个有争议的问题，因为（2）的相对速度增加的原因不止于此。事实证明，x64 JIT 编译器足够聪明，可以计算出数组长度是否恒定（并且似乎也是循环中展开次数的倍数）：因此，代码仅在每次迭代结束时进行边界检查，并且每次展开都变得只是：-

        total += intArray[j]; j++;
00000081 8B 44 0B 10          mov         eax,dword ptr [rbx+rcx+10h] 
00000085 03 F0                add         esi,eax

我通过更改应用程序以让数组大小在命令行上指定并查看不同的汇编器输出来证明这一点。

在此练习中发现的其他事情： -

对于独立增量操作（即不使用结果），前缀/后缀之间的速度没有差异。
当在索引器中使用增量操作时，汇编器显示前缀表示法稍微更有效（并且在原始情况下非常接近，我认为这只是时间差异并将它们称为相等 - 我的错误）。当编译为 x86 时，差异更加明显。
循环展开确实有效。与具有数组边界优化的标准循环相比，4 次汇总始终提供 10%-20% 的改进（x64/常量情况为 34%）。增加汇总数量会带来不同的计时，在索引器中有后缀的情况下，有些时间会慢得多，因此如果展开，我将坚持使用 4，并且仅在特定情况的大量计时后才更改它。

OK after much research (sad I know!), I think have answered my own question:

The answer is Maybe.
Apparently the JIT compilers do look for patterns (see http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx) to decide when and how array bounds checking can be optimized but whether it is the same pattern I was guessing at or not I don't know.

In this case, it is a moot point because the relative speed increase of (2) was due to something more than that. Turns out that the x64 JIT compiler is clever enough to work out whether an array length is constant (and seemingly also a multiple of the number of unrolls in a loop): So the code was only bounds checking at the end of each iteration and the each unroll became just:-

        total += intArray[j]; j++;
00000081 8B 44 0B 10          mov         eax,dword ptr [rbx+rcx+10h] 
00000085 03 F0                add         esi,eax

I proved this by changing the app to let the array size be specified on the command line and seeing the different assembler output.

Other things discovered during this excercise:-

For a standalone increment operation (ie the result is not used), there is no difference in speed between prefix/postfix.
When an increment operation is used in an indexer, the assembler shows that prefix notation is slightly more efficient (and so close in the the original case that I assumed it was just a timing discrepency and called them equal - my mistake). The difference is more pronounced when compiled as x86.
Loop unrolling does work. Compared to a standard loop with array bounds optimization, 4 rollups always gave an improvement of 10%-20% (and the x64/constant case 34%). Increasing the number of rollups gave varied timing with some very much slower in the case of a postfix in the indexer, so I'll stick with 4 if unrolling and only change that after extensive timing for a specific case.

回复收藏 0 原文

森林很绿却致人迷途 2024-11-15 13:32:29

有趣的结果。我要做的是：

重写应用程序以将整个测试进行两次。
在两次测试运行之间放置一个消息框。
编译发布，没有优化等等。
在调试器外部启动可执行文件。
当消息框出现时，连接调试器
现在检查抖动为两种不同情况生成的代码。

然后您就会知道其中一个的抖动是否比另一个更好。例如，抖动可能意识到在一种情况下它可以删除数组边界检查，但在另一种情况下却没有意识到这一点。我不知道;我不是抖动方面的专家。

所有这些繁琐的原因是因为附加调试器时抖动可能会生成不同的代码。如果您想知道它在正常情况下的作用，那么您必须确保代码在正常的非调试器情况下得到抖动。

回复收藏 0 原文

一人独醉 2024-11-15 13:32:29

我喜欢性能测试，也喜欢快速程序，所以我很欣赏你的问题。

我试图重现你的发现但失败了。在我的 Intel i7 x64 系统上，在 x86|Release 配置中的 .NET4 框架上运行代码示例，所有四个测试用例产生的时序大致相同。

为了进行测试，我创建了一个全新的控制台应用程序项目并使用 QueryPerformanceCounter API 调用来获取基于 CPU 的高分辨率计时器。我尝试了 jmax 的两种设置：

jmax = 1000
jmax = 1000000，

因为数组的局部性通常会对性能表现产生很大影响并且 of 循环的大小会增加。但是，在我的测试中，两个数组大小的行为相同。

我已经做了很多性能优化，我学到的一件事是，您可以非常轻松地优化应用程序，使其在一台特定计算机上运行得更快，同时无意中导致它在另一台电脑。

我在这里不是在假设。我调整了内部循环，投入了数小时和数天的工作，以使程序运行得更快，但我的希望却破灭了，因为我在我的工作站上优化它，而目标计算机是不同型号的英特尔处理器。

所以这个故事的寓意是：

代码片段 (2) 在您的计算机上比代码片段 (3) 运行得更快，但在我的计算机上却不然

这就是为什么某些编译器针对不同的处理器有特殊的优化开关，或者某些应用程序有不同的版本，即使一个版本可以轻松地在所有支持的硬件上运行。

因此，如果您要进行这样的测试，则必须采用与 JIT 编译器编写者相同的方式：您必须在各种硬件上执行测试，然后选择一个混合，一个happy-medium 可在最普遍的硬件上提供最佳性能。

回复收藏 0 原文

~没有更多了~

关于作者

东北女汉子

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

C# 为 ++ 生成了 IL运算符 - 前缀/后缀表示法何时以及为何更快

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

C# 为 ++ 生成了 IL运算符 - 前缀/后缀表示法何时以及为何更快

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。