了解特定的 CIL/CLR 优化
编辑:我在最后添加了 ASM。
我相信学习如何在平台上编写优秀代码的最佳方法是尝试该平台,从而了解它。因此,这个问题旨在加深对 CLR 的理解,而不是尝试纳米优化。
尽管如此,我还是想到融合设置和评估变量这两个操作会更快。事实证明,确实如此。在下面的代码中,第二个循环的执行时间约为第一个循环的 60%:
private sealed class Temp
{
public int val;
}
private void button13_Click(object sender, EventArgs e)
{
Temp t = new Temp();
Temp t1;
int T1 = Environment.TickCount;
for (int i = 0; i < 1000000000; i++)
{
t1 = t;
if (t1.val++ == 1000)
{
t1.val = 0;
}
}
int T2 = Environment.TickCount;
for (int i = 0; i < 1000000000; i++)
{
if ((t1 = t).val++ == 1000)
{
t1.val = 0;
}
}
int T3 = Environment.TickCount;
MessageBox.Show((T2 - T1).ToString() + Environment.NewLine +
(T3 - T2).ToString() + Environment.NewLine +
t.val.ToString());
}
在大多数情况下,CIL 编译器会在堆栈上创建设置值的副本,这意味着通常需要的存储和获取不需要。这可以解释速度明显显着增加的原因。
然而,这段特定代码的反编译 C# 和 IL 并没有执行此操作,而是增加了开销。但它的速度几乎是原来的两倍。
编辑2:我在物理上切换了循环,发现第二个循环总是快两倍。为什么?所以我添加了一个“预热”循环,这导致第一个循环的速度大约是原来的两倍。它基本上是相同的代码(ASM 方面)。幕后发生了什么?
{
Temp t1;
Temp t = new Temp();
int T1 = Environment.TickCount;
for (int i = 0; i < 0x3b9aca00; i++)
{
t1 = t;
if (t1.val++ == 0x3e8)
{
t1.val = 0;
}
}
int T2 = Environment.TickCount;
for (int i = 0; i < 0x3b9aca00; i++)
{
Temp temp1 = t1 = t;
if (temp1.val++ == 0x3e8)
{
t1.val = 0;
}
}
int T3 = Environment.TickCount;
string[] CS$0$0002 = new string[] { (T2 - T1).ToString(), Environment.NewLine, (T3 - T2).ToString(), Environment.NewLine, t.val.ToString() };
MessageBox.Show(string.Concat(CS$0$0002));
}
编辑:以 64 位 .Net 4 发布模式编译
L_0000: newobj instance void DIRECT_UI.Form1/Temp::.ctor()
L_0005: stloc.0
L_0006: call int32 [mscorlib]System.Environment::get_TickCount()
L_000b: stloc.2
L_000c: ldc.i4.0
L_000d: stloc.3
L_000e: br.s L_0037
L_0010: ldloc.0
L_0011: stloc.1
L_0012: ldloc.1
L_0013: dup
L_0014: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0019: dup
L_001a: stloc.s CS$0$0000
L_001c: ldc.i4.1
L_001d: add
L_001e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0023: ldloc.s CS$0$0000
L_0025: ldc.i4 0x3e8
L_002a: bne.un.s L_0033
L_002c: ldloc.1
L_002d: ldc.i4.0
L_002e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0033: ldloc.3
L_0034: ldc.i4.1
L_0035: add
L_0036: stloc.3
L_0037: ldloc.3
L_0038: ldc.i4 0x3b9aca00
L_003d: blt.s L_0010
L_003f: call int32 [mscorlib]System.Environment::get_TickCount()
L_0044: stloc.s T2
L_0046: ldc.i4.0
L_0047: stloc.s V_5
L_0049: br.s L_0074
L_004b: ldloc.0
L_004c: dup
L_004d: stloc.1
L_004e: dup
L_004f: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0054: dup
L_0055: stloc.s CS$0$0001
L_0057: ldc.i4.1
L_0058: add
L_0059: stfld int32 DIRECT_UI.Form1/Temp::val
L_005e: ldloc.s CS$0$0001
L_0060: ldc.i4 0x3e8
L_0065: bne.un.s L_006e
L_0067: ldloc.1
L_0068: ldc.i4.0
L_0069: stfld int32 DIRECT_UI.Form1/Temp::val
L_006e: ldloc.s V_5
L_0070: ldc.i4.1
L_0071: add
L_0072: stloc.s V_5
L_0074: ldloc.s V_5
L_0076: ldc.i4 0x3b9aca00
L_007b: blt.s L_004b
L_007d: call int32 [mscorlib]System.Environment::get_TickCount()
L_0082: stloc.s T3
L_0084: ldc.i4.5
L_0085: newarr string
L_008a: stloc.s CS$0$0002
L_008c: ldloc.s CS$0$0002
L_008e: ldc.i4.0
L_008f: ldloc.s T2
L_0091: ldloc.2
L_0092: sub
L_0093: stloc.s CS$0$0003
L_0095: ldloca.s CS$0$0003
L_0097: call instance string [mscorlib]System.Int32::ToString()
L_009c: stelem.ref
L_009d: ldloc.s CS$0$0002
L_009f: ldc.i4.1
L_00a0: call string [mscorlib]System.Environment::get_NewLine()
L_00a5: stelem.ref
L_00a6: ldloc.s CS$0$0002
L_00a8: ldc.i4.2
L_00a9: ldloc.s T3
L_00ab: ldloc.s T2
L_00ad: sub
L_00ae: stloc.s CS$0$0004
L_00b0: ldloca.s CS$0$0004
L_00b2: call instance string [mscorlib]System.Int32::ToString()
L_00b7: stelem.ref
L_00b8: ldloc.s CS$0$0002
L_00ba: ldc.i4.3
L_00bb: call string [mscorlib]System.Environment::get_NewLine()
L_00c0: stelem.ref
L_00c1: ldloc.s CS$0$0002
L_00c3: ldc.i4.4
L_00c4: ldloc.0
L_00c5: ldflda int32 DIRECT_UI.Form1/Temp::val
L_00ca: call instance string [mscorlib]System.Int32::ToString()
L_00cf: stelem.ref
L_00d0: ldloc.s CS$0$0002
L_00d2: call string [mscorlib]System.String::Concat(string[])
L_00d7: call valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult [System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
L_00dc: pop
L_00dd: ret
这对我来说没有意义。看起来像反向优化,但运行速度更快。有人能解释一下吗?
ASM:
t1 = t;
000000ac mov rax,qword ptr [rsp+20h]
000000b1 mov qword ptr [rsp+28h],rax
if (t1.val++ == 1000)
000000b6 mov rax,qword ptr [rsp+28h]
000000bb mov eax,dword ptr [rax+8]
000000be mov dword ptr [rsp+74h],eax
000000c2 mov eax,dword ptr [rsp+74h]
000000c6 mov dword ptr [rsp+44h],eax
000000ca mov ecx,dword ptr [rsp+74h]
000000ce inc ecx
000000d0 mov rax,qword ptr [rsp+28h]
000000d5 mov dword ptr [rax+8],ecx
000000d8 cmp dword ptr [rsp+44h],3E8h
000000e0 jne 00000000000000EE
if ((t1 = t).val++ == 1000)
0000011d mov rax,qword ptr [rsp+20h]
00000122 mov qword ptr [rsp+28h],rax
00000127 mov rax,qword ptr [rsp+20h]
0000012c mov eax,dword ptr [rax+8]
0000012f mov dword ptr [rsp+7Ch],eax
00000133 mov eax,dword ptr [rsp+7Ch]
00000137 mov dword ptr [rsp+48h],eax
0000013b mov ecx,dword ptr [rsp+7Ch]
0000013f inc ecx
00000141 mov rax,qword ptr [rsp+20h]
00000146 mov dword ptr [rax+8],ecx
00000149 cmp dword ptr [rsp+48h],3E8h
00000151 jne 000000000000015F
EDIT: I have added the ASM at the end.
I believe the best way to learn how to write good code on a platform is to experiment with the platform and thereby get to understand it. Therefore, this question is seeking to create a better understanding of the CLR, and is not at attempt at nano optimization.
Notwithstanding, it had occurred to me that it would be faster to fuse the two operations of setting and evaluating a variable. As it turns out, it is. In the code below, the 2nd loop executes in about 60% of the time of the first loop:
private sealed class Temp
{
public int val;
}
private void button13_Click(object sender, EventArgs e)
{
Temp t = new Temp();
Temp t1;
int T1 = Environment.TickCount;
for (int i = 0; i < 1000000000; i++)
{
t1 = t;
if (t1.val++ == 1000)
{
t1.val = 0;
}
}
int T2 = Environment.TickCount;
for (int i = 0; i < 1000000000; i++)
{
if ((t1 = t).val++ == 1000)
{
t1.val = 0;
}
}
int T3 = Environment.TickCount;
MessageBox.Show((T2 - T1).ToString() + Environment.NewLine +
(T3 - T2).ToString() + Environment.NewLine +
t.val.ToString());
}
In most cases like this, the CIL compiler creates a duplicate of the set value on the stack, which means that the usually needed store and fetch is not needed. This would account for the apparently significant speed increase.
However, the decompiled C# and IL for this particular piece of code does not do this, but rather adds overhead. Yet it's almost twice as fast.
EDIT2: I switched the loops around physically, and discovered that the second loop is always about twice as fast. Why? So I added a "warm up" loop, which resulted in the first loop being about twice as fast. It's basically the same code (ASM-wise). What is happening behind the scenes?
{
Temp t1;
Temp t = new Temp();
int T1 = Environment.TickCount;
for (int i = 0; i < 0x3b9aca00; i++)
{
t1 = t;
if (t1.val++ == 0x3e8)
{
t1.val = 0;
}
}
int T2 = Environment.TickCount;
for (int i = 0; i < 0x3b9aca00; i++)
{
Temp temp1 = t1 = t;
if (temp1.val++ == 0x3e8)
{
t1.val = 0;
}
}
int T3 = Environment.TickCount;
string[] CS$0$0002 = new string[] { (T2 - T1).ToString(), Environment.NewLine, (T3 - T2).ToString(), Environment.NewLine, t.val.ToString() };
MessageBox.Show(string.Concat(CS$0$0002));
}
EDIT: Compiled in 64 bit .Net 4 Release mode
L_0000: newobj instance void DIRECT_UI.Form1/Temp::.ctor()
L_0005: stloc.0
L_0006: call int32 [mscorlib]System.Environment::get_TickCount()
L_000b: stloc.2
L_000c: ldc.i4.0
L_000d: stloc.3
L_000e: br.s L_0037
L_0010: ldloc.0
L_0011: stloc.1
L_0012: ldloc.1
L_0013: dup
L_0014: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0019: dup
L_001a: stloc.s CS$0$0000
L_001c: ldc.i4.1
L_001d: add
L_001e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0023: ldloc.s CS$0$0000
L_0025: ldc.i4 0x3e8
L_002a: bne.un.s L_0033
L_002c: ldloc.1
L_002d: ldc.i4.0
L_002e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0033: ldloc.3
L_0034: ldc.i4.1
L_0035: add
L_0036: stloc.3
L_0037: ldloc.3
L_0038: ldc.i4 0x3b9aca00
L_003d: blt.s L_0010
L_003f: call int32 [mscorlib]System.Environment::get_TickCount()
L_0044: stloc.s T2
L_0046: ldc.i4.0
L_0047: stloc.s V_5
L_0049: br.s L_0074
L_004b: ldloc.0
L_004c: dup
L_004d: stloc.1
L_004e: dup
L_004f: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0054: dup
L_0055: stloc.s CS$0$0001
L_0057: ldc.i4.1
L_0058: add
L_0059: stfld int32 DIRECT_UI.Form1/Temp::val
L_005e: ldloc.s CS$0$0001
L_0060: ldc.i4 0x3e8
L_0065: bne.un.s L_006e
L_0067: ldloc.1
L_0068: ldc.i4.0
L_0069: stfld int32 DIRECT_UI.Form1/Temp::val
L_006e: ldloc.s V_5
L_0070: ldc.i4.1
L_0071: add
L_0072: stloc.s V_5
L_0074: ldloc.s V_5
L_0076: ldc.i4 0x3b9aca00
L_007b: blt.s L_004b
L_007d: call int32 [mscorlib]System.Environment::get_TickCount()
L_0082: stloc.s T3
L_0084: ldc.i4.5
L_0085: newarr string
L_008a: stloc.s CS$0$0002
L_008c: ldloc.s CS$0$0002
L_008e: ldc.i4.0
L_008f: ldloc.s T2
L_0091: ldloc.2
L_0092: sub
L_0093: stloc.s CS$0$0003
L_0095: ldloca.s CS$0$0003
L_0097: call instance string [mscorlib]System.Int32::ToString()
L_009c: stelem.ref
L_009d: ldloc.s CS$0$0002
L_009f: ldc.i4.1
L_00a0: call string [mscorlib]System.Environment::get_NewLine()
L_00a5: stelem.ref
L_00a6: ldloc.s CS$0$0002
L_00a8: ldc.i4.2
L_00a9: ldloc.s T3
L_00ab: ldloc.s T2
L_00ad: sub
L_00ae: stloc.s CS$0$0004
L_00b0: ldloca.s CS$0$0004
L_00b2: call instance string [mscorlib]System.Int32::ToString()
L_00b7: stelem.ref
L_00b8: ldloc.s CS$0$0002
L_00ba: ldc.i4.3
L_00bb: call string [mscorlib]System.Environment::get_NewLine()
L_00c0: stelem.ref
L_00c1: ldloc.s CS$0$0002
L_00c3: ldc.i4.4
L_00c4: ldloc.0
L_00c5: ldflda int32 DIRECT_UI.Form1/Temp::val
L_00ca: call instance string [mscorlib]System.Int32::ToString()
L_00cf: stelem.ref
L_00d0: ldloc.s CS$0$0002
L_00d2: call string [mscorlib]System.String::Concat(string[])
L_00d7: call valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult [System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
L_00dc: pop
L_00dd: ret
This doesn't make sense to me. It looks like reverse optimization, but runs faster. Can anyone shed some light on this?
ASM:
t1 = t;
000000ac mov rax,qword ptr [rsp+20h]
000000b1 mov qword ptr [rsp+28h],rax
if (t1.val++ == 1000)
000000b6 mov rax,qword ptr [rsp+28h]
000000bb mov eax,dword ptr [rax+8]
000000be mov dword ptr [rsp+74h],eax
000000c2 mov eax,dword ptr [rsp+74h]
000000c6 mov dword ptr [rsp+44h],eax
000000ca mov ecx,dword ptr [rsp+74h]
000000ce inc ecx
000000d0 mov rax,qword ptr [rsp+28h]
000000d5 mov dword ptr [rax+8],ecx
000000d8 cmp dword ptr [rsp+44h],3E8h
000000e0 jne 00000000000000EE
if ((t1 = t).val++ == 1000)
0000011d mov rax,qword ptr [rsp+20h]
00000122 mov qword ptr [rsp+28h],rax
00000127 mov rax,qword ptr [rsp+20h]
0000012c mov eax,dword ptr [rax+8]
0000012f mov dword ptr [rsp+7Ch],eax
00000133 mov eax,dword ptr [rsp+7Ch]
00000137 mov dword ptr [rsp+48h],eax
0000013b mov ecx,dword ptr [rsp+7Ch]
0000013f inc ecx
00000141 mov rax,qword ptr [rsp+20h]
00000146 mov dword ptr [rax+8],ecx
00000149 cmp dword ptr [rsp+48h],3E8h
00000151 jne 000000000000015F
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
生成的 IL 对代码效率仅产生间接影响。工具+选项、调试、常规,取消选中“抑制模块加载时的 JIT 优化”选项。即使在调试程序时,这也会启用 JIT 优化器。确保您选择了发布配置。
在button13_Click 上设置断点。运行并单击按钮。在源代码编辑器窗口中右键单击并选择“转到程序集”。
请注意两个循环如何生成完全相同的机器代码。均针对 x86 和 x64 抖动。当然,这应该是这样的,执行相同逻辑操作的代码应该产生相同的机器代码。这里一切都很好。
这并不一定意味着它会以完全相同的速度运行,尽管经常如此。代码对齐至关重要。
Generated IL has only an indirect impact on code efficiency. Tools + Options, Debugging, General, untick the "Suppress JIT optimization on module load" option. This enables the JIT optimizer even when you debug the program. Make sure you got the Release configuration selected.
Set a breakpoint on button13_Click. Run and click the button. Right-click in the source code editor window and select "Go To Assembly".
Note how both loops generate the exact same machine code. Both for the x86 and the x64 jitter. This is the way it should be of course, code that perform the same logical operation should produce the same machine code. All is well here.
This doesn't necessarily mean it will run at the exact same speed, although it often does. Code alignment is critical.