将 3D 数学转换为 SSE 或其他 SIMD 可以提高多少速度?
我在我的应用程序中广泛使用 3D 数学。 通过将矢量/矩阵库转换为 SSE、AltiVec 或类似的 SIMD 代码,可以实现多少加速?
I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
根据我的经验,我通常会发现从 x87 到 SSE 的算法有 3 倍的改进,而到 VMX/Altivec 的算法有 5 倍以上的改进(因为与管道深度有关的复杂问题,调度等)。 但我通常只在需要处理数百或数千个数字的情况下才这样做,而不是在我一次临时处理一个向量的情况下这样做。
In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.
这不是故事的全部,但可以使用 SIMD 进行进一步的优化,请查看 Miguel 的演示,了解他何时使用 MONO 实现 SIMD 指令,他在 PDC 2008,
(来源:tirania.org)
图片来自 Miguel 的博客条目。
That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,
(source: tirania.org)
Picture from Miguel's blog entry.
最有可能的是,您只会看到非常小的加速(如果有的话),并且该过程将比预期更复杂。 有关更多详细信息,请参阅 Fabian Giesen 撰写的无处不在的 SSE 向量类文章。
Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.
如今,所有优秀的 x86 编译器都会默认生成 SP 和 DP 浮点数学的 SSE 指令。 使用这些指令几乎总是比本地指令更快,即使对于标量操作也是如此,只要您正确安排它们。 这会让许多人感到惊讶,他们过去发现 SSE 很“慢”,并且认为编译器无法生成快速的 SSE 标量指令。 但现在,您必须使用开关来关闭 SSE 生成并使用 x87。 请注意,x87 目前已被有效弃用,并且可能会从未来的处理器中完全删除。 这样做的一个缺点是我们可能会失去在寄存器中执行 80 位 DP 浮点的能力。 但共识似乎是,如果您依赖 80 位而不是 64 位 DP 浮点来获得精度,那么您应该寻找更精确的耐丢失算法。
上述一切都让我感到完全惊讶。 这是非常违反直觉的。 但数据说话。
These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.
Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.
对于 3D 操作,请注意 W 组件中未初始化的数据。 我见过由于 W 中的错误数据而导致 SSE 操作 (_mm_add_ps) 花费正常时间 10 倍的情况。
For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.
答案很大程度上取决于图书馆正在做什么以及如何使用它。
增益可以从几个百分点到“快几倍”,最容易看到增益的区域是那些您不处理孤立的向量或值,而是必须在多个向量或值中处理的区域。同样的方式。
另一个区域是当您达到缓存或内存限制时,这再次需要处理大量值/向量。
增益最显着的领域可能是图像和信号处理、计算模拟以及网格(而不是孤立向量)上的一般 3D 数学运算。
The answer highly depends on what the library is doing and how it is used.
The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.
Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.
The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).
对于一些非常粗略的数字:我听说 ompf.org 上的一些人声称某些手动优化的光线速度提高了 10 倍跟踪例程。 我也取得了一些不错的加速效果。 我估计,根据问题的不同,我的例程次数在 2 倍到 6 倍之间,其中许多都有一些不必要的存储和加载。 如果您的代码中有大量分支,请忘记它,但对于自然数据并行的问题,您可以做得很好。
但是,我应该补充一点,您的算法应该设计用于数据并行执行。
这意味着,如果您有一个如您所提到的通用数学库,那么它应该采用打包向量而不是单个向量,否则您只是在浪费时间。
例如,
大多数性能很重要的问题都可以并行化,因为您很可能会使用大型数据集。 对我来说,你的问题听起来像是过早优化的情况。
For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.
However, I should add that your algorithms should be designed for data-parallel execution.
This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.
E.g. Something like
Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.