帮助进行汇编/SSE 乘法

发布于 2024-09-04 07:57:13 字数 639 浏览 8 评论 0原文

我一直在试图弄清楚如何在非常关键的几行代码中获得一些改进:

float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;

所有 a、b、c... 都是浮点数。

我决定考虑使用 SSE,但似乎找不到任何改进,事实上它的速度慢了一倍。我的 SSE 代码是:

Vector4 abcd, efgh, result;
abcd = [float a, float b, float c, float d];
efgh = [float e, float f, float g, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}

我还尝试使用标准内联汇编,但似乎我无法像使用 SSE 那样用四个浮点打包寄存器。

任何评论或帮助将不胜感激,我主要需要理解为什么我使用 SSE 的计算比串行 C++ 代码慢?

我正在 Windows XP 上的 Visual Studio 2005 中进行编译,使用带有 HT 的 Pentium 4(如果可以提供任何其他信息来协助的话)。

提前致谢!

I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines:

float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;

all a, b, c... are floats.

I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is:

Vector4 abcd, efgh, result;
abcd = [float a, float b, float c, float d];
efgh = [float e, float f, float g, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}

I also attempted using standard inline assembly, but it doesn't appear that I can pack the register with the four floating points like I can with SSE.

Any comments, or help would be greatly appreciated, I mainly need to understand why my calculations using SSE are slower than the serial C++ code?

I'm compiling in Visual Studio 2005, on a Windows XP, using a Pentium 4 with HT if that provides any additional information to assit.

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

溇涏 2024-09-11 07:57:13

正如您所发现的,仅用 SSE 替换几条指令是行不通的,因为您需要在内存中打乱数据,以便正确加载 SSE 寄存器,并且这种在内存中移动数据(该位构造数组)会降低你的性能,因为内存非常慢(除了硬盘之外,内存现在总是瓶颈)。

此外,如果不先写入 RAM 再进行读取,就无法在 SSE 和 FPU/ALU 之间移动数据。现代 IA32 芯片可以很好地应对这种特定模式(先写后读),但仍然会使某些缓存失效,从而产生连锁反应。

为了充分利用 SSE,您需要查看整个算法以及算法使用的数据。 a、b、c 和 d 以及 e、f、g 和 h 的值需要永久保存在这些数组中,以便在加载 SSE 寄存器之前不会在内存中移动数据。这并不简单,可能需要对代码和数据进行大量修改(您可能需要在磁盘上以不同的方式存储数据)。

还值得指出的是,SSE 只有 32 位(如果使用双精度数,则为 64 位),而 FPU 是 80 位(无论是浮点型还是双精度型),因此使用 SSE 与使用 FPU 时得到的结果略有不同。只有您知道这是否会成为问题。

As you've found out, just replacing a couple of instructions with SSE is not going to work because you need to shuffle the data around in memory in order to load the SSE registers correctly, and this moving data around in memory (the bit that constructs the arrays) is going to kill your performance as memory is very slow (hard disk aside, memory is invariably the bottleneck these days).

Also, there is no way to move data between the SSE and the FPU/ALU without using a write to RAM followed by a read. Modern IA32 chips cope well with this particular pattern (write then read) but will still invalidate some cache which will have a knock on effect.

To get the best out of SSE you need to look at the whole algorithm and the data the algorithm uses. The values of a,b,c and d and e, f, g and h need to permanently in those arrays so that there is no shifting data around in memory prior to loading the SSE registers. It is not straightforward and may require a lot of reworking of your code and data (you may need to store the data differently on disk).

It might also be worth pointing out the SSE is only 32bit (or 64bit if you use doubles) whereas the FPU is 80bit (regardless of float or double) so you will get slightly different results when using SSE compared to using the FPU. Only you know if this will be an issue.

蒲公英的约定 2024-09-11 07:57:13

您正在使用未对齐的指令,这非常慢。
您可能想要尝试正确对齐数据、16 字节边界并使用 movap。
更好的选择是使用内在函数,而不是汇编,因为这样编译器可以根据需要自由地订购指令。

you are using unaligned instructions, which are very slow.
You may want to try aligning your data correctly, 16-byte boundary, and using movaps.
You are better alternative is to use intrinsics, rather than assembly, because then compiler is free to order instructions as it seems necessary.

演多会厌 2024-09-11 07:57:13

您可以在较新的 VS 版本(可能是 2005 年)的程序选项中启用 SSE 和 SSE2。使用快速版本进行编译?

另外,SSE 中的代码可能会更慢,因为当您编译串行 C++ 时,编译器很聪明,并且在使其速度非常快方面做得非常好 - 例如,在正确的时间自动将它们放入正确的寄存器中。例如,如果操作串行发生,编译器可以减少缓存和分页的影响。然而,内联汇编器最多只能优化得很差,应该尽可能避免。

此外,您必须为 SSE/2 执行大量工作才能带来显着的好处。

You can enable the use of SSE and SSE2 in the program options in newer VS versions and possibly in 2005. Compile using an express version?

Also, your code in SSE is probably slower because when you compile serial C++, the compiler is smart and does a very good job on making it very fast- for example, automatically putting them in the right registers at the right time. If the operations occur in serial, the compiler can reduce the impact of caching and paging, for example. Inline assembler however can be optimized poorly at best and should be avoided whenever possible.

In addition, you'd have to be performing a HUGE amount of work for SSE/2 to bring a notable benefit.

征﹌骨岁月お 2024-09-11 07:57:13

这是一个旧线程,但我注意到你的示例中有一个错误。如果您想执行此操作:

float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;

那么代码应该是这样的:

Vector4 aceg, bdfh, result;  // xyzw
abcd = [float a, float c, float e, float g];
efgh = [float b, float d, float f, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}

为了获得更快的速度,我建议您不要使用单独的寄存器来存储“结果”。

对于初学者来说,并不是所有的算法都会受益于在 SSE 中重写。数据驱动的算法(例如由查找表驱动的算法)不能很好地转换为 SSE,因为将数据打包和解包为向量以供 SSE 操作会浪费大量时间。

希望这仍然有帮助。

This is an old thread, but I noticed a mistake in your example. If you want to perform this:

float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;

Then the code should be like that:

Vector4 aceg, bdfh, result;  // xyzw
abcd = [float a, float c, float e, float g];
efgh = [float b, float d, float f, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}

And to gain even some more speed, I'd suggest that you don't use a separate register for "result".

For starters, not all algorithms will benefit being rewritten in SSE. Data-driven algorithms (like algorithms driven by look up tables) don't translate well into SSE because a lot of time is lost packing and unpacking data into vectors for SSE to operate.

Hope this still helps.

深海夜未眠 2024-09-11 07:57:13

首先,当您有 128 位(16 字节)对齐的内容时,您应该使用 MOVAPS,因为它会快得多。
编译器通常应该为您提供 16 字节对齐,即使在 32 位系统上也是如此。

您的 C/C++ 行与您的 sse 代码执行的操作不同。

一个 xmm 寄存器中的四个浮点数乘以另一寄存器中的四个浮点数。
给你:

float x = a*e;
float y = b*f;
float z = c*g;
float w = d*h;

在 sse1 中,你必须在乘法之前使用 SHUFPS 对两个寄存器中的浮点数重新排序。

此外,对于处理大于 CPU 缓存的数据,您可以使用非临时存储 (MOVNTPS) 来减少缓存污染。
请注意,在其他情况下,非临时存储要慢得多。

Firstly when you have something 128bit (16byte) aligned you should use MOVAPS as it can be much faster.
The compiler should usually give you 16byte alignment, even on 32bit systems.

Your C/C++ lines don't do the same thing as your sse code.

The four floats in one xmm register are multiplied by the four floats in the other register.
Giving you:

float x = a*e;
float y = b*f;
float z = c*g;
float w = d*h;

In sse1 you have to use SHUFPS to reorder the floats in both registers before multiplying.

Also for processing data that is bigger then the cpu cache you can use non-temporal stores (MOVNTPS) to reduce cache pollution.
Note that non-temporal stores are a lot slower in other cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文