异常、移动语义和优化:受编译器支配(MSVC2010)?
在对我的旧异常类层次结构进行一些升级以利用一些 C++11 功能时,我做了一些速度测试并发现了一些令人沮丧的结果。所有这一切都是用x64位MSVC++2010编译器、最大速度优化/O2完成的。
两个非常简单的结构体,都是按位复制语义。一个不带移动赋值运算符(为什么需要一个?),另一个 - 带。两个简单的内联函数按值返回这些结构的新创建实例,这些实例被分配给局部变量。另外,请注意周围的 try/catch
块。下面是代码:
#include <iostream>
#include <windows.h>
struct TFoo
{
unsigned long long int m0;
unsigned long long int m1;
TFoo( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
};
struct TBar
{
unsigned long long int m0;
unsigned long long int m1;
TBar( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
TBar & operator=( TBar && f )
{
m0 = f.m0;
m1 = f.m1;
f.m0 = f.m1 = 0;
return ( *this );
}
};
TFoo MakeFoo( unsigned long long int f )
{
return ( TFoo( f ) );
}
TBar MakeBar( unsigned long long int f )
{
return ( TBar( f ) );
}
int main( void )
{
try
{
unsigned long long int lMin = 0;
unsigned long long int lMax = 20000000;
LARGE_INTEGER lStart = { 0 };
LARGE_INTEGER lEnd = { 0 };
TFoo lFoo( 0 );
TBar lBar( 0 );
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lFoo = MakeFoo( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lFoo = ( " << lFoo.m0 << " , " << lFoo.m1 << " )\t\tMakeFoo count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lBar = MakeBar( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lBar = ( " << lBar.m0 << " , " << lBar.m1 << " )\t\tMakeBar count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
}
catch( ... ){}
return ( 0 );
}
程序输出:
lFoo = ( 19999999 , 9999999 ) MakeFoo count : 428652
lBar = ( 19999999 , 9999999 ) MakeBar count : 74518
两个循环的汇编器(显示周围的计数器调用):
//- MakeFoo loop START --------------------------------
00000001`3f4388aa 488d4810 lea rcx,[rax+10h]
00000001`3f4388ae ff1594db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4388b4 448bdf mov r11d,edi
00000001`3f4388b7 48897c2428 mov qword ptr [rsp+28h],rdi
00000001`3f4388bc 0f1f4000 nop dword ptr [rax]
00000001`3f4388c0 4981fb002d3101 cmp r11,1312D00h
00000001`3f4388c7 732a jae Prototype_Console!main+0x83 (00000001`3f4388f3)
00000001`3f4388c9 4c895c2450 mov qword ptr [rsp+50h],r11
00000001`3f4388ce 498bc3 mov rax,r11
00000001`3f4388d1 48d1e8 shr rax,1
00000001`3f4388d4 4889442458 mov qword ptr [rsp+58h],rax // these 3 lines
00000001`3f4388d9 0f28442450 movaps xmm0,xmmword ptr [rsp+50h] // are of interest
00000001`3f4388de 660f7f442430 movdqa xmmword ptr [rsp+30h],xmm0 // see MakeBar
00000001`3f4388e4 49ffc3 inc r11
00000001`3f4388e7 4c895c2428 mov qword ptr [rsp+28h],r11
00000001`3f4388ec 4c8b6c2438 mov r13,qword ptr [rsp+38h] // this one too
00000001`3f4388f1 ebcd jmp Prototype_Console!main+0x50 (00000001`3f4388c0)
00000001`3f4388f3 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f4388fb ff1547db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeFoo loop END --------------------------------
//- MakeBar loop START --------------------------------
00000001`3f4389d1 488d8c24c8000000 lea rcx,[rsp+0C8h]
00000001`3f4389d9 ff1569da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4389df 4c8bdf mov r11,rdi
00000001`3f4389e2 48897c2440 mov qword ptr [rsp+40h],rdi
00000001`3f4389e7 4981fb002d3101 cmp r11,1312D00h
00000001`3f4389ee 7322 jae Prototype_Console!main+0x1a2 (00000001`3f438a12)
00000001`3f4389f0 4c895c2478 mov qword ptr [rsp+78h],r11
00000001`3f4389f5 498bf3 mov rsi,r11
00000001`3f4389f8 48d1ee shr rsi,1
00000001`3f4389fb 4d8be3 mov r12,r11 // these 3 lines
00000001`3f4389fe 4c895c2468 mov qword ptr [rsp+68h],r11 // are of interest
00000001`3f438a03 48897c2478 mov qword ptr [rsp+78h],rdi // see MakeFoo
00000001`3f438a08 49ffc3 inc r11
00000001`3f438a0b 4c895c2440 mov qword ptr [rsp+40h],r11
00000001`3f438a10 ebd5 jmp Prototype_Console!main+0x177 (00000001`3f4389e7)
00000001`3f438a12 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f438a1a ff1528da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeBar loop END --------------------------------
如果我删除 try/catch
块,则两次都是相同的。但在存在它的情况下,编译器显然可以通过冗余移动运算符=更好地优化struct
代码。另外,MakeFoo
时间确实取决于 TFoo
的大小及其布局,但总的来说,时间比 MakeBar
差一些。时间并不取决于微小的尺寸变化。
问题:
这是MSVC++2010的编译器特定功能(有人可以检查GCC吗?)?
是否因为编译器必须保留临时数据直到调用完成,所以在
MakeFoo
的情况下它不能“将其分解”,而在MakeBar
的情况下它知道我们允许它使用移动语义,并且它“将其分开”,生成更快的代码?在没有
try\catch
块的情况下,我可以期望类似的事情有相同的行为,但在更复杂的场景中吗?
While doing some upgrades to my old exception classes hierarchy to utilize some of C++11 features, I did some speed tests and came across results that are somewhat frustrating. All of this was done with x64bit MSVC++2010 compiler, maximum speed optimization /O2.
Two very simple struct
's, both bitwise copy semantics. One without move assignment operator (why would you need one?), another - with. Two simple inlined function returning by value newly created instances of these struct
s, which get assigned to local variable. Also, notice try/catch
block around. Here is the code:
#include <iostream>
#include <windows.h>
struct TFoo
{
unsigned long long int m0;
unsigned long long int m1;
TFoo( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
};
struct TBar
{
unsigned long long int m0;
unsigned long long int m1;
TBar( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
TBar & operator=( TBar && f )
{
m0 = f.m0;
m1 = f.m1;
f.m0 = f.m1 = 0;
return ( *this );
}
};
TFoo MakeFoo( unsigned long long int f )
{
return ( TFoo( f ) );
}
TBar MakeBar( unsigned long long int f )
{
return ( TBar( f ) );
}
int main( void )
{
try
{
unsigned long long int lMin = 0;
unsigned long long int lMax = 20000000;
LARGE_INTEGER lStart = { 0 };
LARGE_INTEGER lEnd = { 0 };
TFoo lFoo( 0 );
TBar lBar( 0 );
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lFoo = MakeFoo( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lFoo = ( " << lFoo.m0 << " , " << lFoo.m1 << " )\t\tMakeFoo count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lBar = MakeBar( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lBar = ( " << lBar.m0 << " , " << lBar.m1 << " )\t\tMakeBar count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
}
catch( ... ){}
return ( 0 );
}
Program output:
lFoo = ( 19999999 , 9999999 ) MakeFoo count : 428652
lBar = ( 19999999 , 9999999 ) MakeBar count : 74518
Assembler for both loops (showing surrounding counter calls ) :
//- MakeFoo loop START --------------------------------
00000001`3f4388aa 488d4810 lea rcx,[rax+10h]
00000001`3f4388ae ff1594db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4388b4 448bdf mov r11d,edi
00000001`3f4388b7 48897c2428 mov qword ptr [rsp+28h],rdi
00000001`3f4388bc 0f1f4000 nop dword ptr [rax]
00000001`3f4388c0 4981fb002d3101 cmp r11,1312D00h
00000001`3f4388c7 732a jae Prototype_Console!main+0x83 (00000001`3f4388f3)
00000001`3f4388c9 4c895c2450 mov qword ptr [rsp+50h],r11
00000001`3f4388ce 498bc3 mov rax,r11
00000001`3f4388d1 48d1e8 shr rax,1
00000001`3f4388d4 4889442458 mov qword ptr [rsp+58h],rax // these 3 lines
00000001`3f4388d9 0f28442450 movaps xmm0,xmmword ptr [rsp+50h] // are of interest
00000001`3f4388de 660f7f442430 movdqa xmmword ptr [rsp+30h],xmm0 // see MakeBar
00000001`3f4388e4 49ffc3 inc r11
00000001`3f4388e7 4c895c2428 mov qword ptr [rsp+28h],r11
00000001`3f4388ec 4c8b6c2438 mov r13,qword ptr [rsp+38h] // this one too
00000001`3f4388f1 ebcd jmp Prototype_Console!main+0x50 (00000001`3f4388c0)
00000001`3f4388f3 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f4388fb ff1547db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeFoo loop END --------------------------------
//- MakeBar loop START --------------------------------
00000001`3f4389d1 488d8c24c8000000 lea rcx,[rsp+0C8h]
00000001`3f4389d9 ff1569da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4389df 4c8bdf mov r11,rdi
00000001`3f4389e2 48897c2440 mov qword ptr [rsp+40h],rdi
00000001`3f4389e7 4981fb002d3101 cmp r11,1312D00h
00000001`3f4389ee 7322 jae Prototype_Console!main+0x1a2 (00000001`3f438a12)
00000001`3f4389f0 4c895c2478 mov qword ptr [rsp+78h],r11
00000001`3f4389f5 498bf3 mov rsi,r11
00000001`3f4389f8 48d1ee shr rsi,1
00000001`3f4389fb 4d8be3 mov r12,r11 // these 3 lines
00000001`3f4389fe 4c895c2468 mov qword ptr [rsp+68h],r11 // are of interest
00000001`3f438a03 48897c2478 mov qword ptr [rsp+78h],rdi // see MakeFoo
00000001`3f438a08 49ffc3 inc r11
00000001`3f438a0b 4c895c2440 mov qword ptr [rsp+40h],r11
00000001`3f438a10 ebd5 jmp Prototype_Console!main+0x177 (00000001`3f4389e7)
00000001`3f438a12 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f438a1a ff1528da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeBar loop END --------------------------------
Both times are the same if I remove try/catch
block. But in presence of it, compiler clearly optimizes code better for struct
with redundant move operator=. Also, MakeFoo
time does dependent on the size of TFoo
and its layout, but in, general, time is several worse than for MakeBar
for which time does not depend on small size changes.
Questions:
Is it compiler specific feature ofy MSVC++2010 (could someone check for GCC?)?
Is it because compiler has to preserve temporary until the call finishes, it cannot "rip it apart" in case of
MakeFoo
, and in case ofMakeBar
it knows that we allow it to use move semantics and it "rips it apart", generating faster code?Can I expect same behavior for similar things without
try\catch
block, but in more complicated scenarios?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你的测试有缺陷。当使用
/O2 /EHsc
进行编译时,它会在不到一秒的时间内完成运行,并且测试结果存在很大的可变性。我重试了相同的测试,但运行了 100 倍的迭代,结果如下(多次运行测试的结果相似):
您的测试没有显示两种类型的赋值性能之间存在任何差异。
Your test is flawed. When compiled with
/O2 /EHsc
, it runs to completion in a fraction of a second and there is high variability in the results of the test.I retried the same test, but ran it for 100x as many iterations, with the following results (the results were similar across several runs of the test):
Your test does not show any difference between the performance of assignment of the two types.