这个虚拟方法调用如何比密封方法调用更快?
我正在对虚拟成员与密封成员的性能进行一些修改。
下面是我的测试代码。
输出是
virtual total 3166ms
per call virtual 3.166ns
sealed total 3931ms
per call sealed 3.931ns
我一定做错了什么,因为根据这个虚拟调用比密封调用更快。
我正在发布模式下运行,并打开“优化代码”。
编辑:当在 VS 之外运行(作为控制台应用程序)时,时间接近于白热化。但虚拟几乎总是排在前面。
[TestFixture]
public class VirtTests
{
public class ClassWithNonEmptyMethods
{
private double x;
private double y;
public virtual void VirtualMethod()
{
x++;
}
public void SealedMethod()
{
y++;
}
}
const int iterations = 1000000000;
[Test]
public void NonEmptyMethodTest()
{
var foo = new ClassWithNonEmptyMethods();
//Pre-call
foo.VirtualMethod();
foo.SealedMethod();
var virtualWatch = new Stopwatch();
virtualWatch.Start();
for (var i = 0; i < iterations; i++)
{
foo.VirtualMethod();
}
virtualWatch.Stop();
Console.WriteLine("virtual total {0}ms", virtualWatch.ElapsedMilliseconds);
Console.WriteLine("per call virtual {0}ns", ((float)virtualWatch.ElapsedMilliseconds * 1000000) / iterations);
var sealedWatch = new Stopwatch();
sealedWatch.Start();
for (var i = 0; i < iterations; i++)
{
foo.SealedMethod();
}
sealedWatch.Stop();
Console.WriteLine("sealed total {0}ms", sealedWatch.ElapsedMilliseconds);
Console.WriteLine("per call sealed {0}ns", ((float)sealedWatch.ElapsedMilliseconds * 1000000) / iterations);
}
}
I am doing some tinkering on the performance of virtual vs sealed members.
Below is my test code.
The output is
virtual total 3166ms
per call virtual 3.166ns
sealed total 3931ms
per call sealed 3.931ns
I must be doing something wrong because according to this the virtual call is faster than the sealed call.
I am running in Release mode with "Optimize code" turned on.
Edit: when running outside of VS (as a console app) the times are close to a dead heat. but the virtual almost always comes out in front.
[TestFixture]
public class VirtTests
{
public class ClassWithNonEmptyMethods
{
private double x;
private double y;
public virtual void VirtualMethod()
{
x++;
}
public void SealedMethod()
{
y++;
}
}
const int iterations = 1000000000;
[Test]
public void NonEmptyMethodTest()
{
var foo = new ClassWithNonEmptyMethods();
//Pre-call
foo.VirtualMethod();
foo.SealedMethod();
var virtualWatch = new Stopwatch();
virtualWatch.Start();
for (var i = 0; i < iterations; i++)
{
foo.VirtualMethod();
}
virtualWatch.Stop();
Console.WriteLine("virtual total {0}ms", virtualWatch.ElapsedMilliseconds);
Console.WriteLine("per call virtual {0}ns", ((float)virtualWatch.ElapsedMilliseconds * 1000000) / iterations);
var sealedWatch = new Stopwatch();
sealedWatch.Start();
for (var i = 0; i < iterations; i++)
{
foo.SealedMethod();
}
sealedWatch.Stop();
Console.WriteLine("sealed total {0}ms", sealedWatch.ElapsedMilliseconds);
Console.WriteLine("per call sealed {0}ns", ((float)sealedWatch.ElapsedMilliseconds * 1000000) / iterations);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您正在测试内存对齐对代码效率的影响。 32 位 JIT 编译器无法为 C# 代码中大小超过 32 位、long 和 double 的值类型生成有效的代码。问题的根源是 32 位 GC 堆分配器,它只承诺在 4 的倍数地址上对齐分配的内存。这是这里的一个问题,您正在递增双精度数。 double 仅当在 8 的倍数地址上对齐时才有效。与堆栈相同的问题,在局部变量的情况下,它在 32 位机器上也仅与 4 对齐。
L1 CPU 高速缓存在内部以称为“高速缓存线”的块进行组织。当程序读取未对齐的双精度值时会受到惩罚。特别是跨越缓存行末端的缓存行,必须读取来自两个缓存行的字节并将其粘合在一起。未对齐在 32 位抖动中并不罕见,“x”字段碰巧分配在 8 的倍数的地址上的几率仅为 50-50。如果不是,则“x”和“y”将不对齐,其中之一很可能跨越缓存线。您编写测试的方式会使 VirtualMethod 或 SealedMethod 变慢。确保让他们使用相同的字段以获得可比较的结果。
对于代码来说也是如此。交换虚拟和密封测试的代码以任意改变结果。这样我就可以毫不费力地使密封测试变得更快。鉴于速度差异不大,您可能正在考虑代码对齐问题。 x64 抖动会努力插入 NOP 来对齐分支目标,而 x86 抖动则不会。
您还应该在循环中运行计时测试几次,至少 20 次。然后您可能还会观察到垃圾收集器移动类对象的效果。之后,双打可能会有不同的排列,从而极大地改变时间。访问 64 位值类型值(如 long 或 double)有 3 个不同的时序:在缓存行内按 8 对齐、在缓存行内按 4 对齐以及跨两个缓存行按 4 对齐。按快到慢的顺序。
代价是陡峭的,读取跨缓存线的双精度数大约比读取对齐的双精度数慢三倍。这也是为什么在大对象堆中分配 double[](双精度数组)的核心原因,即使它只有 1000 个元素(远远超出 80KB 的正常阈值),LOH 的对齐保证为 8。这些对齐问题在 x64 抖动生成的代码中完全消失,堆栈和 GC 堆的对齐方式都是 8。
You are testing the effects of memory alignment on code efficiency. The 32-bit JIT compiler has trouble generating efficient code for value types that are more than 32 bits in size, long and double in C# code. The root of the problem is the 32-bit GC heap allocator, it only promises alignment of allocated memory on addresses that are a multiple of 4. That's an issue here, you are incrementing doubles. A double is efficient only when it is aligned on an address that's a multiple of 8. Same issue with the stack, in case of local variables, it is also aligned only to 4 on a 32-bit machine.
The L1 CPU cache is internally organized in blocks called a "cache line". There is a penalty when the program reads a mis-aligned double. Especially one that straddles the end of a cache line, bytes from two cache lines have to be read and glued together. Mis-alignment isn't uncommon in the 32-bit jitter, it is merely 50-50 odds that the 'x' field happens to be allocated on an address that's a multiple of 8. If it isn't then 'x' and 'y' are going to be misaligned and one of them may well straddle the cache line. The way you wrote the test, that's going to either make VirtualMethod or SealedMethod slower. Make sure you let them use the same field to get comparable results.
The same is true for code. Swap the code for the virtual and sealed test to arbitrarily change the outcome. I had no trouble making the sealed test quite a bit faster that way. Given the modest difference in speed, you are probably looking at a code alignment issue. The x64 jitter makes an effort to insert NOPs to get a branch target aligned, the x86 jitter doesn't.
You should also run the timing test several times in a loop, at least 20. You are likely to then also observe the effect of the garbage collector moving the class object. The double may have a different alignment afterward, dramatically changing the timing. Accessing a 64-bit value type value like long or double has 3 distinct timings, aligned on 8, aligned on 4 within a cache line, and aligned on 4 across two cache lines. In fast to slow order.
The penalty is steep, reading a double that straddles a cache line is roughly three times slower than reading an aligned one. Also the core reason why a double[] (array of doubles) is allocated in the Large Object Heap even when it has only 1000 elements, well south of the normal threshold of 80KB, the LOH has an alignment guarantee of 8. These alignment problems entirely disappear in code generated by the x64 jitter, both the stack and the GC heap have an alignment of 8.
首先,您必须将方法标记为sealed
。其次,为虚拟方法提供
重写
。创建派生类的实例。作为第三个测试,创建一个
密封覆盖
方法。现在你可以开始比较了。
编辑:您可能应该在 VS 之外运行它。
更新:
我的意思的例子。
现在测试
Baz
和Woz
实例的Bar
调用速度。我还怀疑程序集外部的成员和类可见性可能会影响 JIT 分析。
First, you have to mark the methodsealed
.Secondly, provide an
override
to the virtual method. Create an instance of the derived class.As a third test, create a
sealed override
method.Now you can start comparing.
Edit: You should probably run this outside VS.
Update:
Example of what I mean.
Now test the call speed of
Bar
for an instance ofBaz
andWoz
.I also suspect member and class visibility outside the assembly could affect JIT analysis.
您可能会看到一些启动成本。尝试将 Test-A/Test-B 代码包装在循环中并运行几次。您可能还会看到某种排序效果。为了避免这种情况(以及循环效果的顶部/底部),请将其展开 2-3 次。
You might be seeing some start up cost. Try wrapping the Test-A/Test-B code in a loop and run it several times. You might also be seeing some kind of ordering effects. To avoid that (and top/bottom of loop effects), unroll it 2-3 times.
以下面的代码作为测试参考,我们使用Ildasm.exe(IL反汇编器)工具来分析编译器生成的Microsoft中间语言(MSIL)信息。
要运行此工具,请打开 Visual Studio 的开发人员命令提示符并执行命令 ildasm。
应用程序启动后,加载前一个应用程序的可执行文件(或程序集)
没有为此图像提供替代文本
双击Main方法可以查看Microsoft中间语言(MSIL)信息。
正如您所看到的,每个类都使用 newobj 通过将对象引用压入堆栈来创建新实例,并使用 callvirt 来调用 DoStuff() 方法的后期绑定其各自的对象。
从这些信息来看,编译器似乎以相同的方式管理密封类、派生类和基类。为了确定起见,让我们通过使用 Visual Studio 中的“反汇编”窗口分析JIT 编译的代码来更深入地了解。
通过选择工具 > 下的启用地址级调试来启用反汇编。选项>调试>一般。
没有为此图像提供替代文本
在应用程序开始时设置制动点并开始调试。一旦应用程序到达制动点,通过选择调试>打开“反汇编”窗口。窗口>反汇编。
正如我们在前面的代码中看到的,虽然对象的创建是相同的,但调用密封类和派生/基类的方法所执行的指令略有不同。将数据移入 RAM 寄存器(mov 指令)后,调用密封方法,在实际调用该方法之前执行 dword ptr [ecx] 和 ecx(cmp 指令)之间的比较。
根据 Torbjöorn Granlund 撰写的报告《AMD 和 Intel x86 处理器的指令延迟和吞吐量》,Intel Pentium 4 中以下指令的速度为:
这种类型 cmp 的每个周期指令
此类类型每个周期的指令数
总之,当今编译器和处理器的优化使得密封类和非密封类之间的性能基本上很小,与大多数应用程序无关。
参考
https:// learn.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.newobj?view=netframework-4.8
https:// learn.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.callvirt?view=netframework-4.8
https: //learn.microsoft.com/en-us/visualstudio/debugger/how-to-use-the-disassemble-window?view=vs-2019
https://www.aldeid.com/wiki/X86- assembly/Instructions
处理器:https://gmplib.org/~tege/x86-timing.pdf
Using as reference for our test the following code, let's analyze the Microsoft intermediate language (MSIL) information generated by the compiler by using the Ildasm.exe (IL Disassembler) tool.
To run this tool, open the Developer Command Prompt for Visual Studio and execute the command ildasm.
Once the application is started, load the executable (or assembly) of the previous application
No alt text provided for this image
Double click on the Main method to view the Microsoft intermediate language (MSIL) information.
As you can see each class use newobj to create a new instance by pushing an object reference onto the stack and callvirt to calls a late-bound of the DoStuff() method of its respective object.
Judging on this information seems that both sealed, derived and base classes are managed in the same way by the compiler. Just to be sure, let's get deeper by analyzing the JIT-compiled code with the Disassembly window in Visual Studio.
Enable the Disassembly by selecting Enable address-level debugging, under Tools > Options > Debugging > General.
No alt text provided for this image
Set the a brake point at the beginning of the application and start the debug. Once the application hits the brake-point open the Disassembly window by selecting Debug > Windows > Disassembly.
As we can see in the previous code, while the creation of the objects is the same, the instruction executed to invoke the methods of the sealed and derived/base class are slightly different. After moving data into registers of the RAM (mov instruction), the invoke of the sealed method, execute a comparison between dword ptr [ecx] and ecx (cmp instruction) before actually call the method.
According to the report written by Torbj¨orn Granlund, Instruction latencies and throughput for AMD and Intel x86 processors, the speed of the following instruction in a Intel Pentium 4 are:
instructions per cycle of this type
instructions per cycle of this type
In conclusion, the optimization of the now days compilers and processors have made the performances between sealed and not-sealed classed basically so little that are irrelevant to the majority of the applications.
References
https://learn.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.newobj?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.callvirt?view=netframework-4.8
https://learn.microsoft.com/en-us/visualstudio/debugger/how-to-use-the-disassembly-window?view=vs-2019
https://www.aldeid.com/wiki/X86-assembly/Instructions
processors: https://gmplib.org/~tege/x86-timing.pdf