如何使用 SSE 执行 8 x 8 矩阵运算?
我最初的尝试看起来像这样(假设我们想要乘以)
__m128 mat[n]; /* rows */
__m128 vec[n] = {1,1,1,1};
float outvector[n];
for (int row=0;row<n;row++) {
for(int k =3; k < 8; k = k+ 4)
{
__m128 mrow = mat[k];
__m128 v = vec[row];
__m128 sum = _mm_mul_ps(mrow,v);
sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
}
_mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
}
但这显然行不通。我该如何处理这个问题?
我应该一次加载 4 个......
另一个问题是:如果我的数组非常大(比如 n = 1000),我怎样才能使其 16 字节对齐?这可能吗?
My initial attempt looked like this (supposed we want to multiply)
__m128 mat[n]; /* rows */
__m128 vec[n] = {1,1,1,1};
float outvector[n];
for (int row=0;row<n;row++) {
for(int k =3; k < 8; k = k+ 4)
{
__m128 mrow = mat[k];
__m128 v = vec[row];
__m128 sum = _mm_mul_ps(mrow,v);
sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
}
_mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
}
But this clearly doesn't work. How do I approach this?
I should load 4 at a time....
The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好的...我将使用行主矩阵约定。每行
[m]
需要 (2) 个 __m128 元素来产生 8 个浮点数。 8x1 向量v
是一个列向量。由于您使用的是haddps
指令,我假设 SSE3 可用。查找r = [m] * v
:至于对齐,__m128类型的变量应该在堆栈上自动对齐。对于动态内存,这不是一个安全的假设。某些 malloc/new 实现可能只返回保证 8 字节对齐的内存。
内在函数标头提供 _mm_malloc 和 _mm_free。在这种情况下,对齐参数应为 (16)。
OK... I'll use a row-major matrix convention. Each row of
[m]
requires (2) __m128 elements to yield 8 floats. The 8x1 vectorv
is a column vector. Since you're using thehaddps
instruction, I'll assume SSE3 is available. Findingr = [m] * v
:As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.
The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.
Intel 开发了一个小矩阵库,适用于尺寸范围从 1×1 到 6 的矩阵×6。应用说明AP-930 流SIMD 扩展 - 矩阵乘法 详细介绍了两个 6×6 矩阵相乘的算法。这应该可以通过一些努力适应其他尺寸的矩阵。
Intel has developed a Small Matrix Library for matrices with sizes ranging from 1×1 to 6×6. Application Note AP-930 Streaming SIMD Extensions - Matrix Multiplication describes in detail the algorithm for multiplying two 6×6 matrices. This should be adaptable to other size matrices with some effort.