当前位置：文江博客话题详情

C# performance expression-trees dynamically-generated

编译为委托表达式的性能

发布于 2024-10-17 23:58:26 字数 6796 浏览 7 评论 0 原文

我正在生成一个表达式树，它将属性从源对象映射到目标对象，然后将其编译为 Func 并执行。

这是生成的 LambdaExpression 的调试视图：

.Lambda #Lambda1<System.Func`3[MemberMapper.Benchmarks.Program+ComplexSourceType,MemberMapper.Benchmarks.Program+ComplexDestinationType,MemberMapper.Benchmarks.Program+ComplexDestinationType]>(
    MemberMapper.Benchmarks.Program+ComplexSourceType $right,
    MemberMapper.Benchmarks.Program+ComplexDestinationType $left) {
    .Block(
        MemberMapper.Benchmarks.Program+NestedSourceType $Complex$955332131,
        MemberMapper.Benchmarks.Program+NestedDestinationType $Complex$2105709326) {
        $left.ID = $right.ID;
        $Complex$955332131 = $right.Complex;
        $Complex$2105709326 = .New MemberMapper.Benchmarks.Program+NestedDestinationType();
        $Complex$2105709326.ID = $Complex$955332131.ID;
        $Complex$2105709326.Name = $Complex$955332131.Name;
        $left.Complex = $Complex$2105709326;
        $left
    }
}

清理后将是：

(left, right) =>
{
    left.ID = right.ID;
    var complexSource = right.Complex;
    var complexDestination = new NestedDestinationType();
    complexDestination.ID = complexSource.ID;
    complexDestination.Name = complexSource.Name;
    left.Complex = complexDestination;
    return left;
}

这是将属性映射到这些类型的代码：

public class NestedSourceType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexSourceType
{
  public int ID { get; set; }
  public NestedSourceType Complex { get; set; }
}

public class NestedDestinationType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexDestinationType
{
  public int ID { get; set; }
  public NestedDestinationType Complex { get; set; }
}

执行此操作的手动代码是：

var destination = new ComplexDestinationType
{
  ID = source.ID,
  Complex = new NestedDestinationType
  {
    ID = source.Complex.ID,
    Name = source.Complex.Name
  }
};

问题是当我编译LambdaExpression并对生成的delegate进行基准测试，它比手动版本慢大约 10 倍。我不知道为什么会这样。整个想法是最大程度地提高性能，而无需繁琐的手动映射。

当我从 Bart de Smet 的关于此主题的博客文章，并对计算素数的手动版本与编译的表达式树进行了基准测试，它们的性能完全相同。

当 LambdaExpression 的调试视图看起来像您所期望的那样时，什么会导致如此巨大的差异？

编辑

根据要求，我添加了我使用的基准：

public static ComplexDestinationType Foo;

static void Benchmark()
{

  var mapper = new DefaultMemberMapper();

  var map = mapper.CreateMap(typeof(ComplexSourceType),
                             typeof(ComplexDestinationType)).FinalizeMap();

  var source = new ComplexSourceType
  {
    ID = 5,
    Complex = new NestedSourceType
    {
      ID = 10,
      Name = "test"
    }
  };

  var sw = Stopwatch.StartNew();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = new ComplexDestinationType
    {
      ID = source.ID + i,
      Complex = new NestedDestinationType
      {
        ID = source.Complex.ID + i,
        Name = source.Complex.Name
      }
    };
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = mapper.Map<ComplexSourceType, ComplexDestinationType>(source);
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  var func = (Func<ComplexSourceType, ComplexDestinationType, ComplexDestinationType>)
             map.MappingFunction;

  var destination = new ComplexDestinationType();

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = func(source, new ComplexDestinationType());
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);
}

第二个比手动执行要慢，这是可以理解的，因为它涉及字典查找和一些对象实例化，但第三个应该和它一样快正在调用的原始委托以及从 Delegate 到 Func 的转换发生在循环之外。

我也尝试将手动代码包装在函数中，但我记得它并没有产生明显的区别。无论哪种方式，函数调用都不应该增加一个数量级的开销。

我还进行了两次基准测试，以确保 JIT 不会干扰。

编辑

您可以在这里获取该项目的代码：

https://github.com/ JulianR/MemberMapper/

我使用了 Sons-of-Strike 调试器扩展（如 Bart de Smet 的博客文章中所述）来转储动态方法生成的 IL：

IL_0000: ldarg.2 
IL_0001: ldarg.1 
IL_0002: callvirt 6000003 ComplexSourceType.get_ID()
IL_0007: callvirt 6000004 ComplexDestinationType.set_ID(Int32)
IL_000c: ldarg.1 
IL_000d: callvirt 6000005 ComplexSourceType.get_Complex()
IL_0012: brfalse IL_0043
IL_0017: ldarg.1 
IL_0018: callvirt 6000006 ComplexSourceType.get_Complex()
IL_001d: stloc.0 
IL_001e: newobj 6000007 NestedDestinationType..ctor()
IL_0023: stloc.1 
IL_0024: ldloc.1 
IL_0025: ldloc.0 
IL_0026: callvirt 6000008 NestedSourceType.get_ID()
IL_002b: callvirt 6000009 NestedDestinationType.set_ID(Int32)
IL_0030: ldloc.1 
IL_0031: ldloc.0 
IL_0032: callvirt 600000a NestedSourceType.get_Name()
IL_0037: callvirt 600000b NestedDestinationType.set_Name(System.String)
IL_003c: ldarg.2 
IL_003d: ldloc.1 
IL_003e: callvirt 600000c ComplexDestinationType.set_Complex(NestedDestinationType)
IL_0043: ldarg.2 
IL_0044: ret

我不是 IL 专家，但这看起来很漂亮简单明了并且正是您所期望的，不是吗？那为什么这么慢呢？没有奇怪的拳击操作，没有隐藏的实例，什么都没有。它与上面的表达式树不完全相同，因为现在对 right.Complex 也有一个 null 检查。

这是手动版本的代码（通过 Reflector 获得）：

L_0000: ldarg.1 
L_0001: ldarg.0 
L_0002: callvirt instance int32 ComplexSourceType::get_ID()
L_0007: callvirt instance void ComplexDestinationType::set_ID(int32)
L_000c: ldarg.0 
L_000d: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_0012: brfalse.s L_0040
L_0014: ldarg.0 
L_0015: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_001a: stloc.0 
L_001b: newobj instance void NestedDestinationType::.ctor()
L_0020: stloc.1 
L_0021: ldloc.1 
L_0022: ldloc.0 
L_0023: callvirt instance int32 NestedSourceType::get_ID()
L_0028: callvirt instance void NestedDestinationType::set_ID(int32)
L_002d: ldloc.1 
L_002e: ldloc.0 
L_002f: callvirt instance string NestedSourceType::get_Name()
L_0034: callvirt instance void NestedDestinationType::set_Name(string)
L_0039: ldarg.1 
L_003a: ldloc.1 
L_003b: callvirt instance void ComplexDestinationType::set_Complex(class NestedDestinationType)
L_0040: ldarg.1 
L_0041: ret

看起来与我相同。

编辑

我点击了 Michael B 关于此主题的回答中的链接。我尝试在接受的答案中实现这个技巧，它成功了！如果您想了解该技巧的摘要：它会创建一个动态程序集，并将表达式树编译为该程序集中的静态方法，并且由于某种原因，速度快了 10 倍。这样做的一个缺点是，我的基准类是内部的（实际上，公共类嵌套在内部类中），当我尝试访问它们时，它会抛出异常，因为它们不可访问。似乎没有解决方法，但我可以简单地检测引用的类型是否是内部的，并决定使用哪种编译方法。

但仍然让我烦恼的是为什么素数方法在性能上与编译的表达式树相同。

再次，我欢迎任何人在 GitHub 存储库中运行代码来确认我的测量结果并确保我没有疯:)

原文

I'm generating an expression tree that maps properties from a source object to a destination object, that is then compiled to a Func<TSource, TDestination, TDestination> and executed.

This is the debug view of the resulting LambdaExpression:

.Lambda #Lambda1<System.Func`3[MemberMapper.Benchmarks.Program+ComplexSourceType,MemberMapper.Benchmarks.Program+ComplexDestinationType,MemberMapper.Benchmarks.Program+ComplexDestinationType]>(
    MemberMapper.Benchmarks.Program+ComplexSourceType $right,
    MemberMapper.Benchmarks.Program+ComplexDestinationType $left) {
    .Block(
        MemberMapper.Benchmarks.Program+NestedSourceType $Complex$955332131,
        MemberMapper.Benchmarks.Program+NestedDestinationType $Complex$2105709326) {
        $left.ID = $right.ID;
        $Complex$955332131 = $right.Complex;
        $Complex$2105709326 = .New MemberMapper.Benchmarks.Program+NestedDestinationType();
        $Complex$2105709326.ID = $Complex$955332131.ID;
        $Complex$2105709326.Name = $Complex$955332131.Name;
        $left.Complex = $Complex$2105709326;
        $left
    }
}

Cleaned up it would be:

(left, right) =>
{
    left.ID = right.ID;
    var complexSource = right.Complex;
    var complexDestination = new NestedDestinationType();
    complexDestination.ID = complexSource.ID;
    complexDestination.Name = complexSource.Name;
    left.Complex = complexDestination;
    return left;
}

That's the code that maps the properties on these types:

public class NestedSourceType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexSourceType
{
  public int ID { get; set; }
  public NestedSourceType Complex { get; set; }
}

public class NestedDestinationType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexDestinationType
{
  public int ID { get; set; }
  public NestedDestinationType Complex { get; set; }
}

The manual code to do this is:

var destination = new ComplexDestinationType
{
  ID = source.ID,
  Complex = new NestedDestinationType
  {
    ID = source.Complex.ID,
    Name = source.Complex.Name
  }
};

The problem is that when I compile the LambdaExpression and benchmark the resulting delegate it is about 10x slower than the manual version. I have no idea why that is. And the whole idea about this is maximum performance without the tedium of manual mapping.

When I take code by Bart de Smet from his blog post on this topic and benchmark the manual version of calculating prime numbers versus the compiled expression tree, they are completely identical in performance.

What can cause this huge difference when the debug view of the LambdaExpression looks like what you would expect?

EDIT

As requested I added the benchmark I used:

public static ComplexDestinationType Foo;

static void Benchmark()
{

  var mapper = new DefaultMemberMapper();

  var map = mapper.CreateMap(typeof(ComplexSourceType),
                             typeof(ComplexDestinationType)).FinalizeMap();

  var source = new ComplexSourceType
  {
    ID = 5,
    Complex = new NestedSourceType
    {
      ID = 10,
      Name = "test"
    }
  };

  var sw = Stopwatch.StartNew();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = new ComplexDestinationType
    {
      ID = source.ID + i,
      Complex = new NestedDestinationType
      {
        ID = source.Complex.ID + i,
        Name = source.Complex.Name
      }
    };
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = mapper.Map<ComplexSourceType, ComplexDestinationType>(source);
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  var func = (Func<ComplexSourceType, ComplexDestinationType, ComplexDestinationType>)
             map.MappingFunction;

  var destination = new ComplexDestinationType();

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = func(source, new ComplexDestinationType());
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);
}

The second one is understandably slower than doing it manually as it involves a dictionary lookup and a few object instantiations, but the third one should be just as fast as it's the raw delegate there that's being invoked and the cast from Delegate to Func happens outside the loop.

I tried wrapping the manual code in a function as well, but I recall that it didn't make a noticeable difference. Either way, a function call shouldn't add an order of magnitude of overhead.

I also do the benchmark twice to make sure the JIT isn't interfering.

EDIT

You can get the code for this project here:

https://github.com/JulianR/MemberMapper/

I used the Sons-of-Strike debugger extension as described in that blog post by Bart de Smet to dump the generated IL of the dynamic method:

IL_0000: ldarg.2 
IL_0001: ldarg.1 
IL_0002: callvirt 6000003 ComplexSourceType.get_ID()
IL_0007: callvirt 6000004 ComplexDestinationType.set_ID(Int32)
IL_000c: ldarg.1 
IL_000d: callvirt 6000005 ComplexSourceType.get_Complex()
IL_0012: brfalse IL_0043
IL_0017: ldarg.1 
IL_0018: callvirt 6000006 ComplexSourceType.get_Complex()
IL_001d: stloc.0 
IL_001e: newobj 6000007 NestedDestinationType..ctor()
IL_0023: stloc.1 
IL_0024: ldloc.1 
IL_0025: ldloc.0 
IL_0026: callvirt 6000008 NestedSourceType.get_ID()
IL_002b: callvirt 6000009 NestedDestinationType.set_ID(Int32)
IL_0030: ldloc.1 
IL_0031: ldloc.0 
IL_0032: callvirt 600000a NestedSourceType.get_Name()
IL_0037: callvirt 600000b NestedDestinationType.set_Name(System.String)
IL_003c: ldarg.2 
IL_003d: ldloc.1 
IL_003e: callvirt 600000c ComplexDestinationType.set_Complex(NestedDestinationType)
IL_0043: ldarg.2 
IL_0044: ret

I'm no expert at IL, but this seems pretty straightfoward and exactly what you would expect, no? Then why is it so slow? No weird boxing operations, no hidden instantiations, nothing. It's not exactly the same as expression tree above as there's also a null check on right.Complex now.

This is the code for the manual version (obtained through Reflector):

L_0000: ldarg.1 
L_0001: ldarg.0 
L_0002: callvirt instance int32 ComplexSourceType::get_ID()
L_0007: callvirt instance void ComplexDestinationType::set_ID(int32)
L_000c: ldarg.0 
L_000d: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_0012: brfalse.s L_0040
L_0014: ldarg.0 
L_0015: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_001a: stloc.0 
L_001b: newobj instance void NestedDestinationType::.ctor()
L_0020: stloc.1 
L_0021: ldloc.1 
L_0022: ldloc.0 
L_0023: callvirt instance int32 NestedSourceType::get_ID()
L_0028: callvirt instance void NestedDestinationType::set_ID(int32)
L_002d: ldloc.1 
L_002e: ldloc.0 
L_002f: callvirt instance string NestedSourceType::get_Name()
L_0034: callvirt instance void NestedDestinationType::set_Name(string)
L_0039: ldarg.1 
L_003a: ldloc.1 
L_003b: callvirt instance void ComplexDestinationType::set_Complex(class NestedDestinationType)
L_0040: ldarg.1 
L_0041: ret

Looks identical to me..

EDIT

I followed the link in Michael B's answer about this topic. I tried implementing the trick in the accepted answer and it worked! If you want a summary of the trick: it creates a dynamic assembly and compiles the expression tree into a static method in that assembly and for some reason that's 10x faster. A downside to this is that my benchmark classes were internal (actually, public classes nested in an internal one) and it threw an exception when I tried to access them because they weren't accessible. There doesn't seem to be a workaround that, but I can simply detect if the types referenced are internal or not and decide which approach to compilation to use.

What still bugs me though is why that prime numbers method is identical in performance to the compiled expression tree.

And again, I welcome anyone to run the code at that GitHub repository to confirm my measurements and to make sure I'm not crazy :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江心雾 2024-10-24 23:58:26

对于如此巨大的窃听事件来说，这很奇怪。有一些事情需要考虑。首先，VS 编译的代码应用了不同的属性，这些属性可能会影响抖动以进行不同的优化。

您是否在这些结果中包括编译委托的第一次执行？您不应该，您应该忽略任一代码路径的第一次执行。您还应该将普通代码转换为委托，因为委托调用比调用实例方法稍慢，而实例方法又比调用静态方法慢。

至于其他更改，需要考虑以下事实：已编译的委托有一个闭包对象，该对象未在此处使用，但这意味着这是一个目标委托，其执行速度可能会慢一些。您会注意到编译的委托有一个目标对象，并且所有参数都向下移动一位。

此外，由 lcg 生成的方法被认为是静态的，由于寄存器切换业务，当编译为委托时，它们往往比实例方法慢。（Duffy 说“this”指针在 CLR 中有一个保留寄存器，当您有一个静态委托时，它必须转移到另一个寄存器，从而调用一点开销）。
最后，运行时生成的代码似乎比 VS 生成的代码运行得稍慢。在运行时生成的代码似乎有额外的沙箱，并且是从不同的程序集启动的（如果您不相信我，请尝试使用诸如 ldftn 操作码或 calli 操作码之类的东西，那些 Reflection.emited 委托会编译，但不会让您实际执行它们）这会产生最小的开销。

另外，您正在发布模式下运行吗？
我们在这里查看了一个类似的主题：
为什么 Func<>从 Expression> 创建比 Func<> 慢直接声明？

编辑：
另请参阅我的回答：
DynamicMethod 比编译的 IL 函数慢得多

主要要点是，您应该将以下代码添加到计划创建和调用运行时生成的代码的程序集中。

[assembly: AllowPartiallyTrustedCallers]
[assembly: SecurityTransparent]
[assembly: SecurityRules(SecurityRuleSet.Level2,SkipVerificationInFullTrust=true)]

并始终使用内置委托类型或来自具有这些标志的程序集的委托类型。

原因是匿名动态代码托管在始终标记为部分信任的程序集中。通过允许部分信任的呼叫者，您可以跳过部分握手过程。透明度意味着您的代码不会提高安全级别（即行为缓慢），最后真正的技巧是调用标记为跳过验证的程序集中托管的委托类型。 Func#Invoke 是完全可信的，因此不需要验证。这将为您提供 VS 编译器生成的代码的性能。如果不使用这些属性，您会看到 .NET 4 中的开销。您可能认为 SecurityRuleSet.Level1 是避免这种开销的好方法，但切换安全模型的成本也很高。

简而言之，添加这些属性，然后您的微循环性能测试将运行大致相同。

This is pretty strange for such a huge overheard. There are a few things to take into account. First the VS compiled code has different properties applied to it that might influence the jitter to optimize differently.

Are you including the first execution for the compiled delegate in these results? You shouldn't, you should ignore the first execution of either code path. You should also turn the normal code into a delegate as delegate invocation is slightly slower than invoking an instance method, which is slower than invoking a static method.

As for other changes there is something to account for the fact that the compiled delegate has a closure object which isn't being used here but means that this is a targeted delegate which might perform a bit slower. You'll notice the compiled delegate has a target object and all the arguments are shifted down by one.

Also methods generated by lcg are considered static which tend to be slower when compiled to delegates than instance methods because of register switching business. (Duffy said that the "this" pointer has a reserved register in CLR and when you have a delegate for a static it has to be shifted to a different register invoking a slight overhead).
Finally, code generated at runtime seems to run slightly slower than code generated by VS. Code generated at runtime seems to have extra sandboxing and is launched from a different assembly (try using something like ldftn opcode or calli opcode if you don't believe me, those reflection.emited delegates will compile but won't let you actually execute them) which invokes a minimal overhead.

Also you are running in release mode right?
There was a similar topic where we looked over this problem here:
Why is Func<> created from Expression<Func<>> slower than Func<> declared directly?

Edit:
Also see my answer here:
DynamicMethod is much slower than compiled IL function

The main takeaway is that you should add the following code to the assembly where you plan to create and invoke run-time generated code.

[assembly: AllowPartiallyTrustedCallers]
[assembly: SecurityTransparent]
[assembly: SecurityRules(SecurityRuleSet.Level2,SkipVerificationInFullTrust=true)]

And to always use a built-in delegate type or one from an assembly with those flags.

The reason being that anonymous dynamic code is hosted in an assembly that is always marked as partial trust. By allowing partially trusted callers you can skip part of the handshake. The transparency means that your code is not going to raise the security level (i.e. slow behavior), And finally the real trick is to invoke a delegate type hosted in an assembly that is marked as skip verification. Func<int,int>#Invoke is fully trusted, so no verification is needed. This will give you performance of code generated from the VS compiler. By not using these attributes you are looking at an overhead in .NET 4. You might think that SecurityRuleSet.Level1 would be a good way to avoid this overhead, but switching security models is also expensive.

In short, add those attributes, and then your micro-loop performance test, will run about the same.

回复收藏 0 原文

动次打次papapa 2024-10-24 23:58:26

听起来您遇到了调用开销。不过，无论来源如何，如果您的方法在从已编译的程序集加载时运行得更快，只需将其编译为程序集并加载即可！请参阅我的答案为什么是函数<>从 Expression> 创建比 Func<> 慢直接声明？了解有关如何声明的更多详细信息。

回复收藏 0 原文

飘逸的'云 2024-10-24 23:58:26

您可以通过Reflection.Emit手动编译表达式树。它通常会提供更快的编译时间（在我的例子中快了约 30 倍），并且允许您调整发出的结果性能。这并不难做到，特别是如果您的表达式是有限的已知子集。

这个想法是使用ExpressionVisitor来遍历表达式并发出相应表达式类型的IL。编写自己的访问者来处理已知的表达式子集，以及对于尚不支持的表达式类型回退到正常Expression.Compile也“相当”简单。

在我的例子中，我生成委托：

Func<object[], object> createA = state =>
    new A(
        new B(), 
        (string)state[11], 
        new ID[2] { new D1(), new D2() }) { 
        Prop = new P(new B()), Bop = new B() 
    };

测试创建相应的表达式树，并将其 Expression.Compile 与访问和发出 IL，然后从 DynamicMethod 创建委托进行比较。

结果：

编译表达式 3000 次：814
调用编译表达式 5000000 次：724
从表达式发出 3000 次：36
运行发出的表达式 5000000 次：722

手动编译时为 36 vs 814。

这里是完整代码。

You are may compile Expression Tree manually via Reflection.Emit. It will generally provide faster compilation time (in my case below ~30 times faster), and will allow you to tune emitted result performance. And it not so hard to do, especially if your Expressions are limited known subset.

The idea is to use ExpressionVisitor to traverse the expression and emit the IL for corresponding expression type. It's also "quite" simple to write your own Visitor to handle the known subset of expressions, and fallback to normal Expression.Compile for not yet supported expression types.

In my case I am generating the delegate:

Func<object[], object> createA = state =>
    new A(
        new B(), 
        (string)state[11], 
        new ID[2] { new D1(), new D2() }) { 
        Prop = new P(new B()), Bop = new B() 
    };

The test creates the corresponding expression tree and compares its Expression.Compile vs visiting and emitting the IL and then creating delegate from DynamicMethod.

The results:

Compile Expression 3000 times: 814
Invoke Compiled Expression 5000000 times: 724
Emit from Expression 3000 times: 36
Run Emitted Expression 5000000 times: 722

36 vs 814 when compiling manually.

Here the full code.

回复收藏 0 原文