为什么我的数据似乎没有对齐？

发布于 2024-09-04 10:14:17 字数 824 浏览 9 评论 0原文

我试图弄清楚如何最好地预先计算一些正弦和余弦值，将它们存储在对齐的块中，然后稍后将它们用于 SSE 计算：

在程序开始时，我创建一个带有成员的对象：

static __m128 *m_sincos;

然后我在构造函数中初始化该成员：

m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
  m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));

当我使用 m_sincos 时，遇到三个问题：
- 数据似乎没有对齐

movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash

- 变量似乎不正确

movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values

- 真正让我困惑的是，这使得一切正常（但太慢）：

__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0

原文

I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:

At the beginning of my program, I create an object with member:

static __m128 *m_sincos;

then I initialize that member in the constructor:

m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
  m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));

When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned

movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash

-The variables do not seem to be correct

movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values

-What really confuses me is that this makes everything work (but is too slow):

__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回忆那么伤 2024-09-11 10:14:17

m_sincos[t] 是一个 C 表达式。然而，在汇编指令中（__asm？），它被解释为 x86 寻址模式，具有完全不同的结果。例如，VS2008 SP1 将：编译

movaps xmm0, m_sincos[t]

为：（当应用程序在调试模式下崩溃时，请参阅反汇编窗口）

movaps xmm0, xmmword ptr [t]

该解释尝试将变量 t 地址处存储的 128 位值复制到 xmm0 中。然而，t 是一个可能未对齐地址的 32 位值。执行该指令可能会导致对齐失败，并且在 t 的地址对齐的奇怪情况下会得到不正确的结果。

您可以使用适当的 x86 寻址模式来解决此问题。这是缓慢但清晰的版本：

__asm mov eax, m_sincos                  ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4                         ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]

旁注：

当我将其放入完整的程序中时，会发生一些奇怪的事情：

#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>

int main()
{
    static __m128 *m_sincos;
    int Bins = 4;

    m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
    for (int t=0; t<Bins; t++) {
        m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
        __asm movaps xmm0, m_sincos[t];
        __asm mov eax, m_sincos
        __asm mov ebx, t
        __asm shl ebx, 4
        __asm movaps xmm0, [eax+ebx];
    }

    return 0;
}

当您运行此程序时，如果您密切关注寄存器窗口，您可能会注意到一些奇怪的事情。尽管结果是正确的，但在执行 movaps 指令之前，xmm0 获取了正确的值。这是怎么发生的？

查看生成的汇编代码可以看出，_mm_set_ps()将sin/cos结果加载到xmm0中，然后将其保存到m_sincos[t]的内存地址中。但该值也保留在 xmm0 中。 _mm_set_ps 是一个“内在”，而不是函数调用；完成后它不会尝试恢复它使用的寄存器的值。

如果可以从中吸取教训，那就是在使用 SSE 内部函数时，始终使用它们，以便编译器可以为您进行优化。否则，如果您使用内联汇编，也请始终使用它。

m_sincos[t] is a C expression. In an assembly instruction, however, (__asm?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:

movaps xmm0, m_sincos[t]

into: (see the disassembly window when the app crashes in debug mode)

movaps xmm0, xmmword ptr [t]

That interpretation attempts to copy a 128-bit value stored at the address of the variable t into xmm0. t, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t's address is aligned.

You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:

__asm mov eax, m_sincos                  ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4                         ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]

Sidenote:

When I put this in a complete program, something odd occurs:

#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>

int main()
{
    static __m128 *m_sincos;
    int Bins = 4;

    m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
    for (int t=0; t<Bins; t++) {
        m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
        __asm movaps xmm0, m_sincos[t];
        __asm mov eax, m_sincos
        __asm mov ebx, t
        __asm shl ebx, 4
        __asm movaps xmm0, [eax+ebx];
    }

    return 0;
}

When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0 is getting the correct value before the movaps instruction is executed. How does that happen?

A look at the generated assembly code shows that _mm_set_ps() loads the sin/cos results into xmm0, then saves it to the memory address of m_sincos[t]. But the value remains there in xmm0 too. _mm_set_ps is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.

If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.

回复收藏 0 原文