编译器如何处理 SSE(或任何)内部函数?
不久前,我在某处读到 SSE 内部函数可以编译成高效的机器代码,因为编译器对待它们的方式与普通函数不同。我想知道编译器实际上是如何做到这一点的,以及 C 程序员可以做些什么来促进这个过程。是否有任何关于如何使用内部函数以使编译器更轻松地生成高效机器代码的工作的指南?
谢谢。
A while ago I read somewhere that SSE intrinsic functions compile into efficient machine code because compilers treat them differently from ordinary functions. I am wandering how actually compilers do it and what C programmers can do to facilitate the process. Are there any guidelines on how to use intrinsic functions in a manner that makes compiler's job of generating efficient machine code easier.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
与 Necrolis 所写的相反,内在函数可能会也可能不会编译成它们所代表的指令。对于复制或加载指令(例如
_mm_load_pd
)尤其如此,因为编译器在使用内部函数时仍然负责寄存器分配和赋值。这意味着如果两个位置可以由同一寄存器表示,则根本不需要将值从一个位置复制到另一个位置。在这种情况下,编译器可能会选择删除该副本。如果从未使用过结果,它也可能选择删除其他指令。查看这篇博文,其中不同编译器的行为是实践中比较。这是 2009 年的内容,因此详细信息可能不再适用。然而,较新的编译器可能会更多地而不是更少地优化您的代码。
至于实际有效地使用内在函数,答案与所有其他性能优化相同:测量、测量和测量。确保您实际上正在处理一段热门代码,找出它缓慢的原因,然后改进它。您很可能会发现改进内存访问模式比使用内在函数更重要。
Contrary to what Necrolis wrote, the intrinsics may or may not compile down to the instructions they represent. This is especially true for copy or load instructions such as
_mm_load_pd
, since the compiler is still responsible for register allocation and assignment when using intrinsics. This means that copying a value from one location to another may not be necessary at all, if the two locations can be represented by the same register. In that case the compiler may choose to remove the copy. It may also choose to remove other instructions if the result is never used.Check out this blog post where the behavior of different compilers is compared in practice. It's from 2009, so the details may no longer apply. However, newer compilers are likely to optimize your code more, not less.
As for actually use intrinsics efficiently, the answer is the same as for all other performance optimization: Measure, measure and measure. Make sure that you are actually dealing with a hot piece of code, find out why it's slow and then improve it. You are very likely to find that improving your memory access patterns is more important than using intrinsics.
内在函数编译为所表示的指令,这是否有效取决于它们的使用方式。
另外,每个编译器对待内在函数的方式略有不同(也称为其特定于实现),但 GCC 是开源的,因此您可以看到他们如何对待 SSE、Open Watcom*、LCC、PCC 和 TCC*都是开源C编译器,虽然它们没有SSE内在函数,但它们应该仍然有内在函数,你可以看看它们是如何处理它们的。
我认为你读到的内容与代码的自动矢量化有关,GCC 的东西(参见 this)和ICC 非常擅长,但它们不如手工优化的代码,至少还没有
*可能已经更新了对 SSE 的支持,最近没有检查过......
The intrinsics compile down to the instructions the represent, whether this is efficient or not depends on how they are used.
also, each compiler treats intrinsics a little differently (aka its implementation specific), but GCC is open source, so you can see how they treat the SSE ones, Open Watcom*, LCC, PCC and TCC* are all open source C compilers, although thwey don't have SSE intrinsics, they should still have intrinsics, and you can see how they handle them.
I think what you read was related to auto vectorization of code, something GCC(see this) and ICC are very good at, but they aren't as good as hand optimized code, at least not yet
*might have been updated with support for SSE, haven't checked lately...