了解特定的 CIL/CLR 优化

发布于 2024-12-28 21:07:42 字数 6002 浏览 1 评论 0原文

编辑:我在最后添加了 ASM。

我相信学习如何在平台上编写优秀代码的最佳方法是尝试该平台,从而了解它。因此,这个问题旨在加深对 CLR 的理解,而不是尝试纳米优化。

尽管如此,我还是想到融合设置和评估变量这两个操作会更快。事实证明,确实如此。在下面的代码中,第二个循环的执行时间约为第一个循环的 60%:

private sealed class Temp
{
    public int val;
}

private void button13_Click(object sender, EventArgs e)
{
    Temp t = new Temp();
    Temp t1;

    int T1 = Environment.TickCount;

    for (int i = 0; i < 1000000000; i++)
    {
        t1 = t;

        if (t1.val++ == 1000)
        {
            t1.val = 0;
        }
    }

    int T2 = Environment.TickCount;

    for (int i = 0; i < 1000000000; i++)
    {
        if ((t1 = t).val++ == 1000)
        {
            t1.val = 0;
        }
    }

    int T3 = Environment.TickCount;

    MessageBox.Show((T2 - T1).ToString() + Environment.NewLine + 
       (T3 - T2).ToString() + Environment.NewLine + 
       t.val.ToString());
}

在大多数情况下,CIL 编译器会在堆栈上创建设置值的副本,这意味着通常需要的存储和获取不需要。这可以解释速度明显显着增加的原因。

然而,这段特定代码的反编译 C# 和 IL 并没有执行此操作,而是增加了开销。但它的速度几乎是原来的两倍。

编辑2:我在物理上切换了循环,发现第二个循环总是快两倍。为什么?所以我添加了一个“预热”循环,这导致第一个循环的速度大约是原来的两倍。它基本上是相同的代码(ASM 方面)。幕后发生了什么?

{
    Temp t1;
    Temp t = new Temp();
    int T1 = Environment.TickCount;
    for (int i = 0; i < 0x3b9aca00; i++)
    {
        t1 = t;
        if (t1.val++ == 0x3e8)
        {
            t1.val = 0;
        }
    }
    int T2 = Environment.TickCount;
    for (int i = 0; i < 0x3b9aca00; i++)
    {
        Temp temp1 = t1 = t;
        if (temp1.val++ == 0x3e8)
        {
            t1.val = 0;
        }
    }
    int T3 = Environment.TickCount;
    string[] CS$0$0002 = new string[] { (T2 - T1).ToString(), Environment.NewLine, (T3 - T2).ToString(), Environment.NewLine, t.val.ToString() };
    MessageBox.Show(string.Concat(CS$0$0002));
}

编辑:以 64 位 .Net 4 发布模式编译

L_0000: newobj instance void DIRECT_UI.Form1/Temp::.ctor()
L_0005: stloc.0 
L_0006: call int32 [mscorlib]System.Environment::get_TickCount()
L_000b: stloc.2 
L_000c: ldc.i4.0 
L_000d: stloc.3 
L_000e: br.s L_0037
L_0010: ldloc.0 
L_0011: stloc.1 
L_0012: ldloc.1 
L_0013: dup 
L_0014: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0019: dup 
L_001a: stloc.s CS$0$0000
L_001c: ldc.i4.1 
L_001d: add 
L_001e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0023: ldloc.s CS$0$0000
L_0025: ldc.i4 0x3e8
L_002a: bne.un.s L_0033
L_002c: ldloc.1 
L_002d: ldc.i4.0 
L_002e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0033: ldloc.3 
L_0034: ldc.i4.1 
L_0035: add 
L_0036: stloc.3 
L_0037: ldloc.3 
L_0038: ldc.i4 0x3b9aca00
L_003d: blt.s L_0010
L_003f: call int32 [mscorlib]System.Environment::get_TickCount()
L_0044: stloc.s T2
L_0046: ldc.i4.0 
L_0047: stloc.s V_5
L_0049: br.s L_0074
L_004b: ldloc.0 
L_004c: dup 
L_004d: stloc.1 
L_004e: dup 
L_004f: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0054: dup 
L_0055: stloc.s CS$0$0001
L_0057: ldc.i4.1 
L_0058: add 
L_0059: stfld int32 DIRECT_UI.Form1/Temp::val
L_005e: ldloc.s CS$0$0001
L_0060: ldc.i4 0x3e8
L_0065: bne.un.s L_006e
L_0067: ldloc.1 
L_0068: ldc.i4.0 
L_0069: stfld int32 DIRECT_UI.Form1/Temp::val
L_006e: ldloc.s V_5
L_0070: ldc.i4.1 
L_0071: add 
L_0072: stloc.s V_5
L_0074: ldloc.s V_5
L_0076: ldc.i4 0x3b9aca00
L_007b: blt.s L_004b
L_007d: call int32 [mscorlib]System.Environment::get_TickCount()
L_0082: stloc.s T3
L_0084: ldc.i4.5 
L_0085: newarr string
L_008a: stloc.s CS$0$0002
L_008c: ldloc.s CS$0$0002
L_008e: ldc.i4.0 
L_008f: ldloc.s T2
L_0091: ldloc.2 
L_0092: sub 
L_0093: stloc.s CS$0$0003
L_0095: ldloca.s CS$0$0003
L_0097: call instance string [mscorlib]System.Int32::ToString()
L_009c: stelem.ref 
L_009d: ldloc.s CS$0$0002
L_009f: ldc.i4.1 
L_00a0: call string [mscorlib]System.Environment::get_NewLine()
L_00a5: stelem.ref 
L_00a6: ldloc.s CS$0$0002
L_00a8: ldc.i4.2 
L_00a9: ldloc.s T3
L_00ab: ldloc.s T2
L_00ad: sub 
L_00ae: stloc.s CS$0$0004
L_00b0: ldloca.s CS$0$0004
L_00b2: call instance string [mscorlib]System.Int32::ToString()
L_00b7: stelem.ref 
L_00b8: ldloc.s CS$0$0002
L_00ba: ldc.i4.3 
L_00bb: call string [mscorlib]System.Environment::get_NewLine()
L_00c0: stelem.ref 
L_00c1: ldloc.s CS$0$0002
L_00c3: ldc.i4.4 
L_00c4: ldloc.0 
L_00c5: ldflda int32 DIRECT_UI.Form1/Temp::val
L_00ca: call instance string [mscorlib]System.Int32::ToString()
L_00cf: stelem.ref 
L_00d0: ldloc.s CS$0$0002
L_00d2: call string [mscorlib]System.String::Concat(string[])
L_00d7: call valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult [System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
L_00dc: pop 
L_00dd: ret 

这对我来说没有意义。看起来像反向优化,但运行速度更快。有人能解释一下吗?

ASM:

                t1 = t;
000000ac  mov         rax,qword ptr [rsp+20h] 
000000b1  mov         qword ptr [rsp+28h],rax 

                if (t1.val++ == 1000)
000000b6  mov         rax,qword ptr [rsp+28h] 
000000bb  mov         eax,dword ptr [rax+8] 
000000be  mov         dword ptr [rsp+74h],eax 
000000c2  mov         eax,dword ptr [rsp+74h] 
000000c6  mov         dword ptr [rsp+44h],eax 
000000ca  mov         ecx,dword ptr [rsp+74h] 
000000ce  inc         ecx 
000000d0  mov         rax,qword ptr [rsp+28h] 
000000d5  mov         dword ptr [rax+8],ecx 
000000d8  cmp         dword ptr [rsp+44h],3E8h 
000000e0  jne         00000000000000EE
                if ((t1 = t).val++ == 1000)
0000011d  mov         rax,qword ptr [rsp+20h] 
00000122  mov         qword ptr [rsp+28h],rax 
00000127  mov         rax,qword ptr [rsp+20h] 
0000012c  mov         eax,dword ptr [rax+8] 
0000012f  mov         dword ptr [rsp+7Ch],eax 
00000133  mov         eax,dword ptr [rsp+7Ch] 
00000137  mov         dword ptr [rsp+48h],eax 
0000013b  mov         ecx,dword ptr [rsp+7Ch] 
0000013f  inc         ecx 
00000141  mov         rax,qword ptr [rsp+20h] 
00000146  mov         dword ptr [rax+8],ecx 
00000149  cmp         dword ptr [rsp+48h],3E8h 
00000151  jne         000000000000015F

EDIT: I have added the ASM at the end.

I believe the best way to learn how to write good code on a platform is to experiment with the platform and thereby get to understand it. Therefore, this question is seeking to create a better understanding of the CLR, and is not at attempt at nano optimization.

Notwithstanding, it had occurred to me that it would be faster to fuse the two operations of setting and evaluating a variable. As it turns out, it is. In the code below, the 2nd loop executes in about 60% of the time of the first loop:

private sealed class Temp
{
    public int val;
}

private void button13_Click(object sender, EventArgs e)
{
    Temp t = new Temp();
    Temp t1;

    int T1 = Environment.TickCount;

    for (int i = 0; i < 1000000000; i++)
    {
        t1 = t;

        if (t1.val++ == 1000)
        {
            t1.val = 0;
        }
    }

    int T2 = Environment.TickCount;

    for (int i = 0; i < 1000000000; i++)
    {
        if ((t1 = t).val++ == 1000)
        {
            t1.val = 0;
        }
    }

    int T3 = Environment.TickCount;

    MessageBox.Show((T2 - T1).ToString() + Environment.NewLine + 
       (T3 - T2).ToString() + Environment.NewLine + 
       t.val.ToString());
}

In most cases like this, the CIL compiler creates a duplicate of the set value on the stack, which means that the usually needed store and fetch is not needed. This would account for the apparently significant speed increase.

However, the decompiled C# and IL for this particular piece of code does not do this, but rather adds overhead. Yet it's almost twice as fast.

EDIT2: I switched the loops around physically, and discovered that the second loop is always about twice as fast. Why? So I added a "warm up" loop, which resulted in the first loop being about twice as fast. It's basically the same code (ASM-wise). What is happening behind the scenes?

{
    Temp t1;
    Temp t = new Temp();
    int T1 = Environment.TickCount;
    for (int i = 0; i < 0x3b9aca00; i++)
    {
        t1 = t;
        if (t1.val++ == 0x3e8)
        {
            t1.val = 0;
        }
    }
    int T2 = Environment.TickCount;
    for (int i = 0; i < 0x3b9aca00; i++)
    {
        Temp temp1 = t1 = t;
        if (temp1.val++ == 0x3e8)
        {
            t1.val = 0;
        }
    }
    int T3 = Environment.TickCount;
    string[] CS$0$0002 = new string[] { (T2 - T1).ToString(), Environment.NewLine, (T3 - T2).ToString(), Environment.NewLine, t.val.ToString() };
    MessageBox.Show(string.Concat(CS$0$0002));
}

EDIT: Compiled in 64 bit .Net 4 Release mode

L_0000: newobj instance void DIRECT_UI.Form1/Temp::.ctor()
L_0005: stloc.0 
L_0006: call int32 [mscorlib]System.Environment::get_TickCount()
L_000b: stloc.2 
L_000c: ldc.i4.0 
L_000d: stloc.3 
L_000e: br.s L_0037
L_0010: ldloc.0 
L_0011: stloc.1 
L_0012: ldloc.1 
L_0013: dup 
L_0014: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0019: dup 
L_001a: stloc.s CS$0$0000
L_001c: ldc.i4.1 
L_001d: add 
L_001e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0023: ldloc.s CS$0$0000
L_0025: ldc.i4 0x3e8
L_002a: bne.un.s L_0033
L_002c: ldloc.1 
L_002d: ldc.i4.0 
L_002e: stfld int32 DIRECT_UI.Form1/Temp::val
L_0033: ldloc.3 
L_0034: ldc.i4.1 
L_0035: add 
L_0036: stloc.3 
L_0037: ldloc.3 
L_0038: ldc.i4 0x3b9aca00
L_003d: blt.s L_0010
L_003f: call int32 [mscorlib]System.Environment::get_TickCount()
L_0044: stloc.s T2
L_0046: ldc.i4.0 
L_0047: stloc.s V_5
L_0049: br.s L_0074
L_004b: ldloc.0 
L_004c: dup 
L_004d: stloc.1 
L_004e: dup 
L_004f: ldfld int32 DIRECT_UI.Form1/Temp::val
L_0054: dup 
L_0055: stloc.s CS$0$0001
L_0057: ldc.i4.1 
L_0058: add 
L_0059: stfld int32 DIRECT_UI.Form1/Temp::val
L_005e: ldloc.s CS$0$0001
L_0060: ldc.i4 0x3e8
L_0065: bne.un.s L_006e
L_0067: ldloc.1 
L_0068: ldc.i4.0 
L_0069: stfld int32 DIRECT_UI.Form1/Temp::val
L_006e: ldloc.s V_5
L_0070: ldc.i4.1 
L_0071: add 
L_0072: stloc.s V_5
L_0074: ldloc.s V_5
L_0076: ldc.i4 0x3b9aca00
L_007b: blt.s L_004b
L_007d: call int32 [mscorlib]System.Environment::get_TickCount()
L_0082: stloc.s T3
L_0084: ldc.i4.5 
L_0085: newarr string
L_008a: stloc.s CS$0$0002
L_008c: ldloc.s CS$0$0002
L_008e: ldc.i4.0 
L_008f: ldloc.s T2
L_0091: ldloc.2 
L_0092: sub 
L_0093: stloc.s CS$0$0003
L_0095: ldloca.s CS$0$0003
L_0097: call instance string [mscorlib]System.Int32::ToString()
L_009c: stelem.ref 
L_009d: ldloc.s CS$0$0002
L_009f: ldc.i4.1 
L_00a0: call string [mscorlib]System.Environment::get_NewLine()
L_00a5: stelem.ref 
L_00a6: ldloc.s CS$0$0002
L_00a8: ldc.i4.2 
L_00a9: ldloc.s T3
L_00ab: ldloc.s T2
L_00ad: sub 
L_00ae: stloc.s CS$0$0004
L_00b0: ldloca.s CS$0$0004
L_00b2: call instance string [mscorlib]System.Int32::ToString()
L_00b7: stelem.ref 
L_00b8: ldloc.s CS$0$0002
L_00ba: ldc.i4.3 
L_00bb: call string [mscorlib]System.Environment::get_NewLine()
L_00c0: stelem.ref 
L_00c1: ldloc.s CS$0$0002
L_00c3: ldc.i4.4 
L_00c4: ldloc.0 
L_00c5: ldflda int32 DIRECT_UI.Form1/Temp::val
L_00ca: call instance string [mscorlib]System.Int32::ToString()
L_00cf: stelem.ref 
L_00d0: ldloc.s CS$0$0002
L_00d2: call string [mscorlib]System.String::Concat(string[])
L_00d7: call valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult [System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
L_00dc: pop 
L_00dd: ret 

This doesn't make sense to me. It looks like reverse optimization, but runs faster. Can anyone shed some light on this?

ASM:

                t1 = t;
000000ac  mov         rax,qword ptr [rsp+20h] 
000000b1  mov         qword ptr [rsp+28h],rax 

                if (t1.val++ == 1000)
000000b6  mov         rax,qword ptr [rsp+28h] 
000000bb  mov         eax,dword ptr [rax+8] 
000000be  mov         dword ptr [rsp+74h],eax 
000000c2  mov         eax,dword ptr [rsp+74h] 
000000c6  mov         dword ptr [rsp+44h],eax 
000000ca  mov         ecx,dword ptr [rsp+74h] 
000000ce  inc         ecx 
000000d0  mov         rax,qword ptr [rsp+28h] 
000000d5  mov         dword ptr [rax+8],ecx 
000000d8  cmp         dword ptr [rsp+44h],3E8h 
000000e0  jne         00000000000000EE
                if ((t1 = t).val++ == 1000)
0000011d  mov         rax,qword ptr [rsp+20h] 
00000122  mov         qword ptr [rsp+28h],rax 
00000127  mov         rax,qword ptr [rsp+20h] 
0000012c  mov         eax,dword ptr [rax+8] 
0000012f  mov         dword ptr [rsp+7Ch],eax 
00000133  mov         eax,dword ptr [rsp+7Ch] 
00000137  mov         dword ptr [rsp+48h],eax 
0000013b  mov         ecx,dword ptr [rsp+7Ch] 
0000013f  inc         ecx 
00000141  mov         rax,qword ptr [rsp+20h] 
00000146  mov         dword ptr [rax+8],ecx 
00000149  cmp         dword ptr [rsp+48h],3E8h 
00000151  jne         000000000000015F

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

微凉 2025-01-04 21:07:42

生成的 IL 对代码效率仅产生间接影响。工具+选项、调试、常规,取消选中“抑制模块加载时的 JIT 优化”选项。即使在调试程序时,这也会启用 JIT 优化器。确保您选择了发布配置。

在button13_Click 上设置断点。运行并单击按钮。在源代码编辑器窗口中右键单击并选择“转到程序集”。

请注意两个循环如何生成完全相同的机器代码。均针对 x86 和 x64 抖动。当然,这应该是这样的,执行相同逻辑操作的代码应该产生相同的机器代码。这里一切都很好。

这并不一定意味着它会以完全相同的速度运行,尽管经常如此。代码对齐至关重要。

Generated IL has only an indirect impact on code efficiency. Tools + Options, Debugging, General, untick the "Suppress JIT optimization on module load" option. This enables the JIT optimizer even when you debug the program. Make sure you got the Release configuration selected.

Set a breakpoint on button13_Click. Run and click the button. Right-click in the source code editor window and select "Go To Assembly".

Note how both loops generate the exact same machine code. Both for the x86 and the x64 jitter. This is the way it should be of course, code that perform the same logical operation should produce the same machine code. All is well here.

This doesn't necessarily mean it will run at the exact same speed, although it often does. Code alignment is critical.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文