Visual C++ 的性能如何? 2008/2010 编译器优化?
我只是想知道 MSVC++ 编译器可以优化代码(带有代码示例)有多好,或者他不能优化什么以及为什么。
例如,我将 SSE 内在函数与类似的东西一起使用(var 是一个 __m128 值)(它用于平截头体剔除测试):
if( var.m128_f32[0] > 0.0f && var.m128_f32[1] > 0.0f && var.m128_f32[2] > 0.0f && var.m128_f32[3] > 0.0f ) {
...
}
当我查看 asm 输出时,我发现它确实编译成一个丑陋的非常跳跃版本(我知道CPU只是讨厌跳跃)并且我也知道我可以使用SSE4.1 PTEST指令对其进行优化,但为什么编译器不这样做(即使编译器编写者定义了PTEST内在函数,所以他们知道指令)?
还有哪些优化是它不能做的(到目前为止)。
这是否意味着我使用当今的技术被迫使用内在函数和内联 ASM 以及链接的 ASM 函数,并且编译器会找到这样的东西(我不这么认为)?
在哪里可以阅读有关 MSVC++ 编译器优化效果的更多信息?
(编辑 1): 我使用了SSE2交换机和FP:fast交换机
Im just wondering how good the MSVC++ Compiler can optimize code(with Code examples) or what he can't optimize and why.
For example i used the SSE-intrinsics with something like this(var is an __m128 value)(it was for an frustrum-culling test):
if( var.m128_f32[0] > 0.0f && var.m128_f32[1] > 0.0f && var.m128_f32[2] > 0.0f && var.m128_f32[3] > 0.0f ) {
...
}
As i took a look at the asm-output i saw that it did compile to an ugly very jumpy version (and i know that the CPU's just hate tight jumps) and i know also that i can optimize it with the SSE4.1 PTEST instruction, but why did the compiler not do it(even if the compiler writers defined the PTEST intrinsic, so they knew the instruction)?
What optimizations can't it do too (until now).
Does this imply that im with the todays technology forced to use intrinsics and inline ASM and linked ASM functions and will compilers ever find such things(i don't think so)?
Where can i read more about how good the MSVC++ compiler optimizes?
(Edit 1):
I used the SSE2 switch and FP:fast switch
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
编译器的默认设置是生成在“最低公分母”CPU 上运行的代码 - 即没有 SSE 4.1 指令的 CPU。
您可以通过仅在构建选项中针对更高版本的 CPU 来更改这一点。
也就是说,就 SSE 优化。我什至不确定它是否支持 SSE 4。该链接对 GCC 的 SSE 优化给予了高度评价:
也许你需要改变编译器!
The default for the compiler is set to generate code that wil run on a 'lowest common denominator' CPU - ie one without SSE 4.1 instructions.
You can change that by targetting later CPUs only in the build options.
That said, the MS compiler is traditionally 'not the best' when it comes to SSE optimisation. I'm not even sure if it supports SSE 4 at all. That link gives good credit to GCC for SSE optimisation:
Perhaps you need to change compiler!
您可能想尝试 Intel 的 ICC 编译器 - 根据我的经验,它生成的代码比 Visual C++ 好得多,特别是对于 SSE 代码。您可以从 intel.com 获取 30 天免费评估许可证。
You might want to try Intel's ICC compiler - in my experience it generates a lot better code than Visual C++, especially for SSE code. You can get a free 30 day evaluation license from intel.com.
您可以激活已编译代码的 asm 视图并亲自查看生成的内容。
You can activate asm view of the compiled code and see yourself what is generated.
查看演示文稿:http://lambda-the-ultimate.org/node/3674
摘要:编译器现在通常会执行许多令人惊奇的技巧,甚至包括通常与命令式编程无关的事情,例如尾部调用优化。 MSVC++ 不是最好的,但看起来还是不错的。
Check the presentation at http://lambda-the-ultimate.org/node/3674
Summary: Compilers generally do lots of amazing tricks now, even things that doesn't seem to be generally related to imperative programming, like tail-call optimization. MSVC++ is not the best, still it seems pretty good.
Ïf 语句会生成条件跳转,除非您可以利用条件移动,但这更有可能是在手写汇编中完成的。有一些规则控制 CPU 的条件跳转假设(分支预测),使得按照规则运行的条件跳转的惩罚是可以接受的。然后无序执行使事情变得更加复杂:)。最重要的是,如果您的代码是直接的,那么最终发生的跳转不会影响性能。您可以查看 Agner Fog 的优化页面。
C 代码的非调试编译特别应该生成四个条件跳转。逻辑与 (&&) 和括号的使用将导致从左到右的测试,因此一个 C 优化可能是首先测试最有可能 > 0.0f 的 f32(如果这样的概率可以是)决定)。您有五种可能的执行变体: test1 true 分支采取 (t1tbt)、test1 false 无分支 (t1fnb) test2 true 分支采取 (t2tbt) 等,给出以下可能的序列
只有采取的分支将导致管道中断,并且分支预测将导致尽可能减少干扰。
假设浮点数的测试成本很高(确实如此),如果 var 是一个联合并且您精通浮点的来龙去脉,您可能会考虑对重叠类型进行整数测试。例如,存储值 1.0f 占用四个字节,存储为 0x00、0x00、0x80、0x3f(x86/little-endian)。将此值作为长整数读取将给出 0x3f800000 或 +1065353216。 0.0f 是 0x00、0x00、0x00、0x00 或 0x00000000(长)。负浮点值与正浮点值的格式完全相同,但最高位被设置为 (0x80000000)。
Ïf-statements generate conditional jumps unless you can utilize conditional moves but that is more likely something done in hand-written assembly. There are rules that govern the CPU's conditional jump assumptions (branch prediction) such that the penalty of a conditional jump which behaves along the rules is acceptable. Then there is out-of-order execution to additionally complicate things :). The bottom line is that if your code is straight-forward the jumps which eventually occur won't mess up performance. You might check out Agner Fog's optimization pages.
A non-debug compilation of your C-code specifically should generate four conditional jumps. The logical ands (&&) and parentheses usage will result in a left-to-right testing so one C optimization could be to test the f32 that is most likely to be >0.0f first (if such a probability can be determined). You have five possible execution variants: test1 true branch taken (t1tbt), test1 false no branch (t1fnb) test2 true branch taken (t2tbt), etc giving the following possible sequences
Only a taken branch will result in a pipelining disruption and branch prediction will minimize the disruption as much as possible.
Assuming floats are expensive to test (they are), if var is a union and you are well-versed in floating-point ins and outs you might consider doing integer testing on the overlapping types. For example the stored value 1.0f occupies four bytes stored as 0x00, 0x00, 0x80, 0x3f (x86/little-endian). Reading this value as a long integer will give 0x3f800000 or +1065353216. 0.0f is 0x00, 0x00, 0x00, 0x00 or 0x00000000 (long). Negative float values have exactly the same format as positive with the exception that the highest bit is set (0x80000000).