需要对我的 SSE/Assembly 尝试提出一些建设性的批评

发布于 2024-09-03 13:19:41 字数 2555 浏览 1 评论 0原文

我正在努力将一些代码转换为 SSE，虽然我有正确的输出，但它比标准 C++ 代码慢。

我需要执行此操作的代码是：

float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;

我所获得的 SSE 代码是：

void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy)
{
    vector4 r;
    __m128 scale = _mm_set1_ps(m);

__asm
{
    mov     eax,    p       //Load into CPU reg
    mov     ebx,    sc
    movups  xmm0,   [eax]   //move vectors to SSE regs
    movups  xmm1,   [ebx]

    mulps   xmm0,   xmm1    //Multiply the Elements

    movaps  xmm2,   xmm0    //make a copy of the array  
    shufps  xmm2,   xmm0,  0x1B //shuffle the array     

    subps   xmm0,   xmm2    //subtract the elements

    mulps   xmm0,   scale   //multiply the vector by the scale

    mov     ecx,    xy      //load the variable into cpu reg
    movups  xmm3,   [ecx]   //move the vector to the SSE regs

    subps   xmm3,   xmm0    //subtract xmm3 - xmm0

    movups  [r],    xmm3    //Save the retun vector, and use elements 0 and 3
    }
}

由于阅读代码非常困难，我将解释我所做的事情：

loaded vector4 , xmm0 _____ p = [px , py , px , py ]
多。通过向量4，xmm1 _ cs = [c , c , s , s ]
__________________________多----------------------------------------
结果，_____________ xmm0 = [pxc, pyc, pxs, pys]

重用结果，xmm0 = [pxc, pyc, pxs, pys]
随机播放结果，xmm2 = [pys, pxs, pyc, pxc]
_____________________减去----------------------------------------
结果，xmm0 = [pxc-pys, pyc-pxs, pxs-pyc, pys- pxc]

重用结果，xmm0 = [pxc-pys, pyc-pxs, pxs -pyc, pys-pxc]
加载 m 个向量4，scale = [m, m, m, m]
__________________________多----------------------------------------
结果，xmm0 = [(pxc-pys)m, (pyc-px*s)m, (pxs-py *c)m, (pys-px*c)m]

加载 xy 向量4, xmm3 = [p2x, p2x, p2y, p2y]
重用，xmm0 = [(pxc-py*s)m, (pyc-px*s)m, (pxs-py*c )m, (pys-px*c)m]
_____________________减去----------------------------------------
结果，xmm3 = [p2x-(pxc-py*s)m, p2x-(pyc-px*s)m, p2y-(pxm >s-py*c)m, p2y-(pys-px*c)*m]

那么 ox = xmm3[0] 和 oy = xmm3[3],所以我基本上不使用 xmm3[1] 或 xmm3[4]

对于阅读本文的困难，我深表歉意，但我希望有人能够为我提供一些指导，因为标准 c++ 代码的运行时间为 0.001444ms 并且SSE代码的运行时间为0.00198ms。

让我知道我是否可以做任何事情来进一步解释/清理这一点。我尝试使用 SSE 的原因是因为我运行此计算数百万次，并且它是减慢我当前代码速度的部分原因。

预先感谢您的任何帮助！布雷特

原文

I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code.

The bit of code that I need to do this for is:

float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;

What I've got for SSE code is:

void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy)
{
    vector4 r;
    __m128 scale = _mm_set1_ps(m);

__asm
{
    mov     eax,    p       //Load into CPU reg
    mov     ebx,    sc
    movups  xmm0,   [eax]   //move vectors to SSE regs
    movups  xmm1,   [ebx]

    mulps   xmm0,   xmm1    //Multiply the Elements

    movaps  xmm2,   xmm0    //make a copy of the array  
    shufps  xmm2,   xmm0,  0x1B //shuffle the array     

    subps   xmm0,   xmm2    //subtract the elements

    mulps   xmm0,   scale   //multiply the vector by the scale

    mov     ecx,    xy      //load the variable into cpu reg
    movups  xmm3,   [ecx]   //move the vector to the SSE regs

    subps   xmm3,   xmm0    //subtract xmm3 - xmm0

    movups  [r],    xmm3    //Save the retun vector, and use elements 0 and 3
    }
}

Since its very difficult to read the code, I'll explain what I did:

loaded vector4 , xmm0 _____ p = [px , py , px , py ]
mult. by vector4, xmm1 _ cs = [c , c , s , s ]
__________________________mult----------------------------
result,_____________ xmm0 = [pxc, pyc, pxs, pys]

reuse result, xmm0 = [pxc, pyc, pxs, pys]
shuffle result, xmm2 = [pys, pxs, pyc, pxc]
_____________________subtract----------------------------
result, xmm0 = [pxc-pys, pyc-pxs, pxs-pyc, pys-pxc]

reuse result, xmm0 = [pxc-pys, pyc-pxs, pxs-pyc, pys-pxc]
load m vector4, scale = [m, m, m, m]
__________________________mult----------------------------
result, xmm0 = [(pxc-pys)m, (pyc-px*s)m, (pxs-py*c)m, (pys-px*c)m]

load xy vector4, xmm3 = [p2x, p2x, p2y, p2y]
reuse, xmm0 = [(pxc-py*s)m, (pyc-px*s)m, (pxs-py*c)m, (pys-px*c)m]
_____________________subtract----------------------------
result, xmm3 = [p2x-(pxc-py*s)m, p2x-(pyc-px*s)m, p2y-(pxs-py*c)m, p2y-(pys-px*c)*m]

then ox = xmm3[0] and oy = xmm3[3], so I essentially don't use xmm3[1] or xmm3[4]

I apologize for the difficulty reading this, but I'm hoping someone might be able to provide some guidance for me, as the standard c++ code runs in 0.001444ms and the SSE code runs in 0.00198ms.

Let me know if there is anything I can do to further explain/clean this up a bit. The reason I'm trying to use SSE is because I run this calculation millions of times, and it is a part of what is slowing down my current code.

Thanks in advance for any help!
Brett

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷爱 2024-09-10 13:19:41

进行这种矢量化的通常方法是将问题“转向一边”。您不是计算 ox 和 oy 的单个值，而是同时计算四个 ox 值和四个 oy 值。这最大限度地减少了浪费的计算和洗牌。

为此，您将多个 x、y、p2x 和 p2y 值捆绑到连续数组中（例如，您可能有一个包含四个 x 值的数组，一个包含四个 y 值的数组，等等）。然后您可以这样做：

movups  %xmm0,  [x]
movups  %xmm1,  [y]
movaps  %xmm2,  %xmm0
mulps   %xmm0,  [c]    // cx
movaps  %xmm3,  %xmm1
mulps   %xmm1,  [s]    // sy
mulps   %xmm2,  [s]    // sx
mulps   %xmm3,  [c]    // cy
subps   %xmm0,  %xmm1  // cx - sy
subps   %xmm2,  %xmm3  // sx - cy
mulps   %xmm0,  scale  // (cx - sy)*m
mulps   %xmm2,  scale  // (sx - cy)*m
movaps  %xmm1,  [p2x]
movaps  %xmm3,  [p2y]
subps   %xmm1,  %xmm0  // p2x - (cx - sy)*m
subps   %xmm3,  %xmm2  // p2y - (sx - cy)*m
movups  [ox],   %xmm1
movups  [oy],   %xmm3

使用这种方法，我们可以在 18 条指令中同时计算 4 个结果，而使用您的方法则需要 13 条指令来计算单个结果。我们也不会浪费任何结果。

它仍然可以改进；由于无论如何您都必须重新排列数据结构才能使用此方法，因此您应该对齐数组并使用对齐的加载和存储而不是未对齐。您应该将 c 和 s 加载到寄存器中，并使用它们来处理 x 和 y 的许多向量，而不是为每个向量重新加载它们。为了获得最佳性能，两个或多个向量的计算量应交错，以确保处理器有足够的工作来防止管道停顿。

（旁注：应该是 cx + sy 而不是 cx - sy？这会给你一个标准的旋转矩阵）

编辑

你的对您正在执行计时的硬件进行评论几乎可以清除所有内容：“Pentium 4 HT，2.79GHz”。这是一个非常古老的微架构，在其上未对齐的移动和洗牌非常慢；管道中没有足够的工作来隐藏算术运算的延迟，并且重新排序引擎并不像较新的微体系结构上那么聪明。

我希望您的矢量代码将证明比 i7 上的标量代码更快，也可能比 Core2 上的标量代码更快。另一方面，如果可以的话，一次做四个会更快。

The usual way to do this sort of vectorization is to turn the problem "on its side". Instead of computing a single value of ox and oy, you compute four ox values and four oy values simultaneously. This minimizes wasted computation and shuffles.

In order to do this, you bundle up several x, y, p2x and p2y values into contiguous arrays (i.e. you might have an array of four values of x, an array of four values of y, etc). Then you can just do:

movups  %xmm0,  [x]
movups  %xmm1,  [y]
movaps  %xmm2,  %xmm0
mulps   %xmm0,  [c]    // cx
movaps  %xmm3,  %xmm1
mulps   %xmm1,  [s]    // sy
mulps   %xmm2,  [s]    // sx
mulps   %xmm3,  [c]    // cy
subps   %xmm0,  %xmm1  // cx - sy
subps   %xmm2,  %xmm3  // sx - cy
mulps   %xmm0,  scale  // (cx - sy)*m
mulps   %xmm2,  scale  // (sx - cy)*m
movaps  %xmm1,  [p2x]
movaps  %xmm3,  [p2y]
subps   %xmm1,  %xmm0  // p2x - (cx - sy)*m
subps   %xmm3,  %xmm2  // p2y - (sx - cy)*m
movups  [ox],   %xmm1
movups  [oy],   %xmm3

Using this approach, we compute 4 results simultaneously in 18 instructions, vs. a single result in 13 instructions with your approach. We're also not wasting any results.

It could still be improved on; since you would have to rearrange data structures anyway to use this approach, you should align the arrays and use aligned loads and stores instead of unaligned. You should load c and s into registers and use them to process many vectors of x and y, instead of reloading them for each vector. For the best performance, two or more vectors worth of computation should be interleaved to make sure the processor has enough work to do an prevent pipeline stalls.

(On a side note: should it be cx + sy instead of cx - sy? That would give you a standard rotation matrix)

Edit

Your comment on what hardware you're doing your timings on pretty much clears everything up: "Pentium 4 HT, 2.79GHz". That's a very old microarchitecture, on which unaligned moves and shuffles are quite slow; you don't have enough work in the pipeline to hide the latency of the arithmetic operations, and the reorder engine isn't nearly as clever as it is on newer microarchitectures.

I expect that your vector code would prove to be faster than the scalar code on i7, and probably on Core2 as well. On the other hand, doing four at a time, if you could, would be much faster still.

回复收藏 0 原文

~没有更多了~