将 3D 数学转换为 SSE 或其他 SIMD 可以提高多少速度?

发布于 2024-07-05 10:27:00 字数 76 浏览 9 评论 0原文

我在我的应用程序中广泛使用 3D 数学。 通过将矢量/矩阵库转换为 SSE、AltiVec 或类似的 SIMD 代码,可以实现多少加速?

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

流星番茄 2024-07-12 10:27:00

根据我的经验,我通常会发现从 x87 到 SSE 的算法有 3 倍的改进,而到 VMX/Altivec 的算法有 5 倍以上的改进(因为与管道深度有关的复杂问题,调度等)。 但我通常只在需要处理数百或数千个数字的情况下才这样做,而不是在我一次临时处理一个向量的情况下这样做。

In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.

甜尕妞 2024-07-12 10:27:00

这不是故事的全部,但可以使用 SIMD 进行进一步的优化,请查看 Miguel 的演示,了解他何时使用 MONO 实现 SIMD 指令,他在 PDC 2008

SIMD 在此特定配置中击败双打。
(来源:tirania.org

图片来自 Miguel 的博客条目。

That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,

SIMD beats doubles' ass in this particular configuration.
(source: tirania.org)

Picture from Miguel's blog entry.

春花秋月 2024-07-12 10:27:00

最有可能的是,您只会看到非常小的加速(如果有的话),并且该过程将比预期更复杂。 有关更多详细信息,请参阅 Fabian Giesen 撰写的无处不在的 SSE 向量类文章。

无处不在的 SSE 向量类:揭穿一个常见的神话

没那么重要

首先,您的向量类对于程序的性能可能并不像您想象的那么重要(如果是,则更可能是因为您做错了什么,而不是因为计算效率低下)。 不要误会我的意思,它可能是整个程序中最常用的类之一,至少在制作 3D 图形时是这样。 但向量运算很常见并不意味着它们会主宰程序的执行时间。

不太热

不容易

现在不行

从来没有

Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.

The Ubiquitous SSE vector class: Debunking a common myth

Not that important

First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.

Not so hot

Not easy

Not now

Not ever

﹉夏雨初晴づ 2024-07-12 10:27:00

如今,所有优秀的 x86 编译器都会默认生成 SP 和 DP 浮点数学的 SSE 指令。 使用这些指令几乎总是比本地指令更快,即使对于标量操作也是如此,只要您正确安排它们。 这会让许多人感到惊讶,他们过去发现 SSE 很“慢”,并且认为编译器无法生成快速的 SSE 标量指令。 但现在,您必须使用开关来关闭 SSE 生成并使用 x87。 请注意,x87 目前已被有效弃用,并且可能会从未来的处理器中完全删除。 这样做的一个缺点是我们可能会失去在寄存器中执行 80 位 DP 浮点的能力。 但共识似乎是,如果您依赖 80 位而不是 64 位 DP 浮点来获得精度,那么您应该寻找更精确的耐丢失算法。

上述一切都让我感到完全惊讶。 这是非常违反直觉的。 但数据说话。

These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.

Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.

幸福丶如此 2024-07-12 10:27:00

对于 3D 操作,请注意 W 组件中未初始化的数据。 我见过由于 W 中的错误数据而导致 SSE 操作 (_mm_add_ps) 花费正常时间 10 倍的情况。

For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.

〃温暖了心ぐ 2024-07-12 10:27:00

答案很大程度上取决于图书馆正在做什么以及如何使用它。

增益可以从几个百分点到“快几倍”,最容易看到增益的区域是那些您不处理孤立的向量或值,而是必须在多个向量或值中处理的区域。同样的方式。

另一个区域是当您达到缓存或内存限制时,这再次需要处理大量值/向量。

增益最显着的领域可能是图像和信号处理、计算模拟以及网格(而不是孤立向量)上的一般 3D 数学运算。

The answer highly depends on what the library is doing and how it is used.

The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.

Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.

The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).

爱已欠费 2024-07-12 10:27:00

对于一些非常粗略的数字:我听说 ompf.org 上的一些人声称某些手动优化的光线速度提高了 10 倍跟踪例程。 我也取得了一些不错的加速效果。 我估计,根据问题的不同,我的例程次数在 2 倍到 6 倍之间,其中许多都有一些不必要的存储和加载。 如果您的代码中有大量分支,请忘记它,但对于自然数据并行的问题,您可以做得很好。

但是,我应该补充一点,您的算法应该设计用于数据并行执行。
这意味着,如果您有一个如您所提到的通用数学库,那么它应该采用打包向量而不是单个向量,否则您只是在浪费时间。

例如,

namespace SIMD {
class PackedVec4d
{
  __m128 x;
  __m128 y;
  __m128 z;
  __m128 w;

  //...
};
}

大多数性能很重要的问题都可以并行化,因为您很可能会使用大型数据集。 对我来说,你的问题听起来像是过早优化的情况。

For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.

However, I should add that your algorithms should be designed for data-parallel execution.
This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.

E.g. Something like

namespace SIMD {
class PackedVec4d
{
  __m128 x;
  __m128 y;
  __m128 z;
  __m128 w;

  //...
};
}

Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文