为什么 Calli 比委托调用更快?

发布于 11-05 18:59 字数 2879 浏览 3 评论 0原文

我正在使用 Reflection.Emit 并发现了很少使用的 EmitCalli。出于好奇,我想知道它与常规方法调用是否有什么不同,所以我编写了下面的代码:

using System;
using System.Diagnostics;
using System.Reflection.Emit;
using System.Runtime.InteropServices;
using System.Security;

[SuppressUnmanagedCodeSecurity]
static class Program
{
    const long COUNT = 1 << 22;
    static readonly byte[] multiply = IntPtr.Size == sizeof(int) ?
      new byte[] { 0x8B, 0x44, 0x24, 0x04, 0x0F, 0xAF, 0x44, 0x24, 0x08, 0xC3 }
    : new byte[] { 0x0f, 0xaf, 0xca, 0x8b, 0xc1, 0xc3 };

    static void Main()
    {
        var handle = GCHandle.Alloc(multiply, GCHandleType.Pinned);
        try
        {
            //Make the native method executable
            uint old;
            VirtualProtect(handle.AddrOfPinnedObject(),
                (IntPtr)multiply.Length, 0x40, out old);
            var mulDelegate = (BinaryOp)Marshal.GetDelegateForFunctionPointer(
                handle.AddrOfPinnedObject(), typeof(BinaryOp));

            var T = typeof(uint); //To avoid redundant typing

            //Generate the method
            var method = new DynamicMethod("Mul", T,
                new Type[] { T, T }, T.Module);
            var gen = method.GetILGenerator();
            gen.Emit(OpCodes.Ldarg_0);
            gen.Emit(OpCodes.Ldarg_1);
            gen.Emit(OpCodes.Ldc_I8, (long)handle.AddrOfPinnedObject());
            gen.Emit(OpCodes.Conv_I);
            gen.EmitCalli(OpCodes.Calli, CallingConvention.StdCall,
                T, new Type[] { T, T });
            gen.Emit(OpCodes.Ret);

            var mulCalli = (BinaryOp)method.CreateDelegate(typeof(BinaryOp));

            var sw = Stopwatch.StartNew();
            for (int i = 0; i < COUNT; i++) { mulDelegate(2, 3); }
            Console.WriteLine("Delegate: {0:N0}", sw.ElapsedMilliseconds);
            sw.Reset();

            sw.Start();
            for (int i = 0; i < COUNT; i++) { mulCalli(2, 3); }
            Console.WriteLine("Calli:    {0:N0}", sw.ElapsedMilliseconds);
        }
        finally { handle.Free(); }
    }

    delegate uint BinaryOp(uint a, uint b);

    [DllImport("kernel32.dll", SetLastError = true)]
    static extern bool VirtualProtect(
        IntPtr address, IntPtr size, uint protect, out uint oldProtect);
}

我在 x86 模式和 x64 模式下运行了代码。结果?

32 位:

  • 代理版本:994
  • 审美干扰镜版本:46

64 位:

  • 代理版本:326
  • 审美干扰镜版本:83

我想现在问题已经很明显了......为什么会有如此巨大的速度差异?


更新:

我还创建了一个 64 位 P/Invoke 版本:

  • 代理版本:284
  • 审美干扰镜版本:77
  • P/调用版本:31

显然,P/Invoke 更快...这是我的基准测试的问题,还是发生了我不明白的事情? (顺便说一下,我正处于发布模式。)

I was playing around with Reflection.Emit and found about about the little-used EmitCalli. Intrigued, I wondered if it's any different from a regular method call, so I whipped up the code below:

using System;
using System.Diagnostics;
using System.Reflection.Emit;
using System.Runtime.InteropServices;
using System.Security;

[SuppressUnmanagedCodeSecurity]
static class Program
{
    const long COUNT = 1 << 22;
    static readonly byte[] multiply = IntPtr.Size == sizeof(int) ?
      new byte[] { 0x8B, 0x44, 0x24, 0x04, 0x0F, 0xAF, 0x44, 0x24, 0x08, 0xC3 }
    : new byte[] { 0x0f, 0xaf, 0xca, 0x8b, 0xc1, 0xc3 };

    static void Main()
    {
        var handle = GCHandle.Alloc(multiply, GCHandleType.Pinned);
        try
        {
            //Make the native method executable
            uint old;
            VirtualProtect(handle.AddrOfPinnedObject(),
                (IntPtr)multiply.Length, 0x40, out old);
            var mulDelegate = (BinaryOp)Marshal.GetDelegateForFunctionPointer(
                handle.AddrOfPinnedObject(), typeof(BinaryOp));

            var T = typeof(uint); //To avoid redundant typing

            //Generate the method
            var method = new DynamicMethod("Mul", T,
                new Type[] { T, T }, T.Module);
            var gen = method.GetILGenerator();
            gen.Emit(OpCodes.Ldarg_0);
            gen.Emit(OpCodes.Ldarg_1);
            gen.Emit(OpCodes.Ldc_I8, (long)handle.AddrOfPinnedObject());
            gen.Emit(OpCodes.Conv_I);
            gen.EmitCalli(OpCodes.Calli, CallingConvention.StdCall,
                T, new Type[] { T, T });
            gen.Emit(OpCodes.Ret);

            var mulCalli = (BinaryOp)method.CreateDelegate(typeof(BinaryOp));

            var sw = Stopwatch.StartNew();
            for (int i = 0; i < COUNT; i++) { mulDelegate(2, 3); }
            Console.WriteLine("Delegate: {0:N0}", sw.ElapsedMilliseconds);
            sw.Reset();

            sw.Start();
            for (int i = 0; i < COUNT; i++) { mulCalli(2, 3); }
            Console.WriteLine("Calli:    {0:N0}", sw.ElapsedMilliseconds);
        }
        finally { handle.Free(); }
    }

    delegate uint BinaryOp(uint a, uint b);

    [DllImport("kernel32.dll", SetLastError = true)]
    static extern bool VirtualProtect(
        IntPtr address, IntPtr size, uint protect, out uint oldProtect);
}

I ran the code in x86 mode and x64 mode. The results?

32-bit:

  • Delegate version: 994
  • Calli version: 46

64-bit:

  • Delegate version: 326
  • Calli version: 83

I guess the question's obvious by now... why is there such a huge speed difference?


Update:

I created a 64-bit P/Invoke version as well:

  • Delegate version: 284
  • Calli version: 77
  • P/Invoke version: 31

Apparently, P/Invoke is faster... is this a problem with my benchmarking, or is there something going on I don't understand? (I'm in release mode, by the way.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

橘虞初梦2024-11-12 18:59:19

鉴于您的性能数据,我假设您一定使用 2.0 框架或类似的框架? 4.0 中的数字要好得多,但“Marshal.GetDelegate”版本仍然较慢。

问题是,并非所有代表都是生而平等的。

托管代码函数的委托本质上只是一个直接函数调用(在 x86 上,这是一个 __fastcall),如果您调用静态函数,则添加一点“switcheroo”(但在 x86 上这只是 3 或 4 条指令)。

另一方面,由“Marshal.GetDelegateForFunctionPointer”创建的委托是对“存根”函数的直接函数调用,它在调用非托管函数之前会产生一些开销(编组等)。在这种情况下,几乎没有编组,并且此调用的编组似乎在 4.0 中得到了相当多的优化(但很可能仍然通过 2.0 上的 ML 解释器) - 但即使在 4.0 中,也有一个 stackWalk 要求非托管代码权限,不是您的审美干扰镜代表的一部分。

我通常发现,如果不了解 .NET 开发团队中的某个人,要了解托管/非托管互操作的情况,最好的办法就是使用 WinDbg 和 SOS 进行一些挖掘。

Given your performance numbers, I assume you must be using the 2.0 framework, or something similar? The numbers are much better in 4.0, but the "Marshal.GetDelegate" version is still slower.

The thing is that not all delegates are created equal.

Delegates for managed code functions are essentially just a straight function call (on x86, that's a __fastcall), with the addition of a little "switcheroo" if you're calling a static function (but that's just 3 or 4 instructions on x86).

Delegates created by "Marshal.GetDelegateForFunctionPointer", on the other hand - are a straight function call into a "stub" function, which does a little overhead (marshalling and whatnot) before calling the unmanaged function. In this case there's very little marshalling, and the marshalling for this call appears to be pretty much optimized out in 4.0 (but most likely still goes through the ML interpreter on 2.0) - but even in 4.0, there's a stackWalk demanding unmanaged code permissions that isn't part of your calli delegate.

I've generally found that, short of knowing someone on the .NET dev team, your best bet on figuring out what's going on w/ managed/unmanaged interop is to do a little digging with WinDbg and SOS.

你穿错了嫁妆2024-11-12 18:59:19

很难回答:)
无论如何我会尝试。

EmitCalli 速度更快,因为它是原始字节码调用。我怀疑 SuppressUnmanagedCodeSecurity 也会禁用一些检查,例如堆栈溢出/数组越界索引检查。所以代码不安全并且全速运行。

委托版本将有一些编译代码来检查类型,并且还将执行取消引用调用(因为委托就像一个类型化函数指针)。

我的两分钱!

Difficult to answer :)
Anyway I will try.

The EmitCalli is faster because it is a raw byte code call. I suspect the SuppressUnmanagedCodeSecurity will also disable some checks, for instance stack overrun/array out of bounds index checks. So the code is not safe and run at full speed.

The delegate version will have some compiled code to check typing, and will also do a de-reference call (because the delegate is like a typed-function pointer).

My two cents!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文