如何使用 SSE 执行 8 x 8 矩阵运算?

发布于 2024-12-18 09:11:47 字数 591 浏览 1 评论 0原文

我最初的尝试看起来像这样(假设我们想要乘以)

  __m128 mat[n]; /* rows */
  __m128 vec[n] = {1,1,1,1};
  float outvector[n];
   for (int row=0;row<n;row++) {
       for(int k =3; k < 8; k = k+ 4)
       {
           __m128 mrow = mat[k];
           __m128 v = vec[row];
           __m128 sum = _mm_mul_ps(mrow,v);
           sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
       }
           _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
 }

但这显然行不通。我该如何处理这个问题?

我应该一次加载 4 个......

另一个问题是:如果我的数组非常大(比如 n = 1000),我怎样才能使其 16 字节对齐?这可能吗?

My initial attempt looked like this (supposed we want to multiply)

  __m128 mat[n]; /* rows */
  __m128 vec[n] = {1,1,1,1};
  float outvector[n];
   for (int row=0;row<n;row++) {
       for(int k =3; k < 8; k = k+ 4)
       {
           __m128 mrow = mat[k];
           __m128 v = vec[row];
           __m128 sum = _mm_mul_ps(mrow,v);
           sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
       }
           _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
 }

But this clearly doesn't work. How do I approach this?

I should load 4 at a time....

The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

小巷里的女流氓 2024-12-25 09:11:47

好的...我将使用行主矩阵约定。每行 [m] 需要 (2) 个 __m128 元素来产生 8 个浮点数。 8x1 向量 v 是一个列向量。由于您使用的是 haddps 指令,我假设 SSE3 可用。查找r = [m] * v

void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2])
{
    __m128 t0, t1, t2, t3, r0, r1, r2, r3;

    t0 = _mm_mul_ps(m[0][0], v[0]);
    t1 = _mm_mul_ps(m[1][0], v[0]);
    t2 = _mm_mul_ps(m[2][0], v[0]);
    t3 = _mm_mul_ps(m[3][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r0 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[0][1], v[1]);
    t1 = _mm_mul_ps(m[1][1], v[1]);
    t2 = _mm_mul_ps(m[2][1], v[1]);
    t3 = _mm_mul_ps(m[3][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r1 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][0], v[0]);
    t1 = _mm_mul_ps(m[5][0], v[0]);
    t2 = _mm_mul_ps(m[6][0], v[0]);
    t3 = _mm_mul_ps(m[7][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r2 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][1], v[1]);
    t1 = _mm_mul_ps(m[5][1], v[1]);
    t2 = _mm_mul_ps(m[6][1], v[1]);
    t3 = _mm_mul_ps(m[7][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r3 = _mm_hadd_ps(t0, t2);

    r[0] = _mm_add_ps(r0, r1);
    r[1] = _mm_add_ps(r2, r3);
}

至于对齐,__m128类型的变量应该在堆栈上自动对齐。对于动态内存,这不是一个安全的假设。某些 malloc/new 实现可能只返回保证 8 字节对齐的内存。

内在函数标头提供 _mm_malloc 和 _mm_free。在这种情况下,对齐参数应为 (16)。

OK... I'll use a row-major matrix convention. Each row of [m] requires (2) __m128 elements to yield 8 floats. The 8x1 vector v is a column vector. Since you're using the haddps instruction, I'll assume SSE3 is available. Finding r = [m] * v :

void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2])
{
    __m128 t0, t1, t2, t3, r0, r1, r2, r3;

    t0 = _mm_mul_ps(m[0][0], v[0]);
    t1 = _mm_mul_ps(m[1][0], v[0]);
    t2 = _mm_mul_ps(m[2][0], v[0]);
    t3 = _mm_mul_ps(m[3][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r0 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[0][1], v[1]);
    t1 = _mm_mul_ps(m[1][1], v[1]);
    t2 = _mm_mul_ps(m[2][1], v[1]);
    t3 = _mm_mul_ps(m[3][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r1 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][0], v[0]);
    t1 = _mm_mul_ps(m[5][0], v[0]);
    t2 = _mm_mul_ps(m[6][0], v[0]);
    t3 = _mm_mul_ps(m[7][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r2 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][1], v[1]);
    t1 = _mm_mul_ps(m[5][1], v[1]);
    t2 = _mm_mul_ps(m[6][1], v[1]);
    t3 = _mm_mul_ps(m[7][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r3 = _mm_hadd_ps(t0, t2);

    r[0] = _mm_add_ps(r0, r1);
    r[1] = _mm_add_ps(r2, r3);
}

As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.

The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.

染墨丶若流云 2024-12-25 09:11:47

Intel 开发了一个小矩阵库,适用于尺寸范围从 1×1 到 6 的矩阵×6。应用说明AP-930 流SIMD 扩展 - 矩阵乘法 详细介绍了两个 6×6 矩阵相乘的算法。这应该可以通过一些努力适应其他尺寸的矩阵。

Intel has developed a Small Matrix Library for matrices with sizes ranging from 1×1 to 6×6. Application Note AP-930 Streaming SIMD Extensions - Matrix Multiplication describes in detail the algorithm for multiplying two 6×6 matrices. This should be adaptable to other size matrices with some effort.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文