使用矩阵算法和常量进行嵌套 for 循环调试。
这组嵌套 for 循环对于 M=64 和 N=64 的值可以正常工作,但当我使 M=128 和 N=64 时不起作用。我有另一个程序来检查矩阵乘法的正确值。直觉上它似乎仍然有效,但给了我错误的答案。
for(int m=64;m<=M;m+=64){
for(int n=64;n<=N;n+=64){
for(int i = m-64; i < m; i+=16){
float *A_column_start, *C_column_start;
__m128 c_1, c_2, c_3, c_4, a_1, a_2, a_3, a_4, mul_1,
mul_2, mul_3, mul_4, b_1;
int j, k;
for(j = m-64; j < m; j++){
//Load 16 contiguous column aligned elements from matrix C in
//c_1-c_4 registers
C_column_start = C+i+j*M;
c_1 = _mm_loadu_ps(C_column_start);
c_2 = _mm_loadu_ps(C_column_start+4);
c_3 = _mm_loadu_ps(C_column_start+8);
c_4 = _mm_loadu_ps(C_column_start+12);
for (k=n-64; k < n; k+=2){
//Load 16 contiguous column aligned elements from matrix A to
//the a_1-a_4 registers
A_column_start = A+k*M;
a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);
//Load a value to resgister b_1 to act as a "B" or ("A^T")
//element to multiply against the A matrix
b_1 = _mm_load1_ps(A_column_start+j);
mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);
//Add together all values of the multiplied A and "B"
//(or "A^T") matrix elements
c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);
//Move over one column in A, and load the next 16 contiguous
//column aligned elements from matrix A to the a_1-a_4 registers
A_column_start+=M;
a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);
//Load a value to resgister b_1 to act as a "B" or "A^T"
//element to multiply against the A matrix
b_1 = _mm_load1_ps(A_column_start+j);
mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);
//Add together all values of the multiplied A and "B" or
//("A^T") matrix elements
c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);
}
//Store the added up C values back to memory
_mm_storeu_ps(C_column_start, c_1);
_mm_storeu_ps(C_column_start+4, c_2);
_mm_storeu_ps(C_column_start+8, c_3);
_mm_storeu_ps(C_column_start+12, c_4);
}
}
}
}}
This set of nested for loops works correctly for values of M=64 and N=64, but does not work when I make M=128 and N=64. I have another program that checks for correct values for the matrix multiply. Intuitively it seems like it should still work, but gives me the wrong answer.
for(int m=64;m<=M;m+=64){
for(int n=64;n<=N;n+=64){
for(int i = m-64; i < m; i+=16){
float *A_column_start, *C_column_start;
__m128 c_1, c_2, c_3, c_4, a_1, a_2, a_3, a_4, mul_1,
mul_2, mul_3, mul_4, b_1;
int j, k;
for(j = m-64; j < m; j++){
//Load 16 contiguous column aligned elements from matrix C in
//c_1-c_4 registers
C_column_start = C+i+j*M;
c_1 = _mm_loadu_ps(C_column_start);
c_2 = _mm_loadu_ps(C_column_start+4);
c_3 = _mm_loadu_ps(C_column_start+8);
c_4 = _mm_loadu_ps(C_column_start+12);
for (k=n-64; k < n; k+=2){
//Load 16 contiguous column aligned elements from matrix A to
//the a_1-a_4 registers
A_column_start = A+k*M;
a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);
//Load a value to resgister b_1 to act as a "B" or ("A^T")
//element to multiply against the A matrix
b_1 = _mm_load1_ps(A_column_start+j);
mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);
//Add together all values of the multiplied A and "B"
//(or "A^T") matrix elements
c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);
//Move over one column in A, and load the next 16 contiguous
//column aligned elements from matrix A to the a_1-a_4 registers
A_column_start+=M;
a_1 = _mm_loadu_ps(A_column_start+i);
a_2 = _mm_loadu_ps(A_column_start+i+4);
a_3 = _mm_loadu_ps(A_column_start+i+8);
a_4 = _mm_loadu_ps(A_column_start+i+12);
//Load a value to resgister b_1 to act as a "B" or "A^T"
//element to multiply against the A matrix
b_1 = _mm_load1_ps(A_column_start+j);
mul_1 = _mm_mul_ps(a_1, b_1);
mul_2 = _mm_mul_ps(a_2, b_1);
mul_3 = _mm_mul_ps(a_3, b_1);
mul_4 = _mm_mul_ps(a_4, b_1);
//Add together all values of the multiplied A and "B" or
//("A^T") matrix elements
c_4 = _mm_add_ps(c_4, mul_4);
c_3 = _mm_add_ps(c_3, mul_3);
c_2 = _mm_add_ps(c_2, mul_2);
c_1 = _mm_add_ps(c_1, mul_1);
}
//Store the added up C values back to memory
_mm_storeu_ps(C_column_start, c_1);
_mm_storeu_ps(C_column_start+4, c_2);
_mm_storeu_ps(C_column_start+8, c_3);
_mm_storeu_ps(C_column_start+12, c_4);
}
}
}
}}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
它适用于 M = 64 和 N=64,因为在这些情况下,您只在相应的循环(最外面的两个循环)中进行一次迭代。当 M = 128 时,您现在在外循环上执行两个步骤,在这种情况下,该行
和该行将
为内循环产生相同的结果,因此本质上对于您在外循环上执行的两个步骤(m=64,128)你只是将一个步骤的结果加倍,m=128。修复方法很简单,只需将 M 更改为 m 以便使用迭代变量即可。
此外,您还应该考虑对齐 A 和 C 中的数据,以便可以执行 SSE 对齐负载。这将导致更快的代码。
It works correctly for M = 64 and N=64 because in these cases you are only doing one iteration in the corresponding looping (the two outer most). When you you have M = 128 you now do two steps on the outer loop in which case the line
and the line
Will produce the same results for the inner loop so essentially for the two steps you do on the outer loop (m=64,128) you are just doubling the result of one step with m=128. The fix is as simple as changing M to m so that you use the iteration variable.
Also you should consider aligning your data in A and C so that you can do SSE aligned loads. This will result in much faster code.
我猜您在代码中使用
M
需要改为使用
m
。也可能在其他使用M
的行中。但是,我并不真正理解您的代码,因为您没有解释代码的用途,而且我不是数学程序员。
I guess your use of
M
in the codeneeds to be using
m
instead. Possibly also in other lines where you useM
.However, I do not really understand your code, since you did not explain what the code is intended to do, and I'm not a math programmer.