在汇编中实现矩阵向量乘法

发布于 2024-12-01 03:37:26 字数 463 浏览 0 评论 0原文

我有一种算法,可以一遍又一遍地执行线性代数的树步骤,

loop{
  first I multiply a Vector and a Matrix, 
  Second I calculate the sum of elements in the Vector 
  and Thirdly I scale the vector using the sum, making sure the vectors elements scale to one.
}

我使用 BLAS 来执行操作,这有点快,但它需要对数据进行树运行,每个步骤一次。现在我想知道通过将这些步骤合并为一个(仅运行一次数据)是否可以获得一些好处。

有没有人有关于如何以最佳方式实现这些调用的经验,我的矩阵约为 100*100,向量有 100 个元素。

我认为该向量可以适合 8 个 128 字节 mmx 寄存器。使乘法变得相当快,有什么想法吗?

I've got an algorithm that does tree steps of linear algebra over and over again,

loop{
  first I multiply a Vector and a Matrix, 
  Second I calculate the sum of elements in the Vector 
  and Thirdly I scale the vector using the sum, making sure the vectors elements scale to one.
}

I'm using BLAS to do the operations, and this is somewhat quick, but it takes tree runs over the data, one for each step. Now I'm wondering if there would be something to gain by combining the steps into one, there by only running over the data once.

Do anyone have any exprience on how to implement these calls in an optimal way, my matrix is about 100*100 and the vector is has 100 elements.

I'm thinking that the vector can fit into something like 8 128byte mmx registers. making the multiplications pretty fast, any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦在深巷 2024-12-08 03:37:26

优化的 BLAS 库是非常棘手的代码,除非您是 asm 编程专家并且了解 CPU 的缓存性能,并且愿意花费大量时间来测试各种方法,否则您不太可能做得更好。如果你想看看它是如何完成的,你可以下载并查看 GOTO BLAS 的源代码(在 asm 中实现,是的)。

我不确定如何对您的代码进行实质性优化。我怀疑已经在 N=100 时,矩阵向量乘积的 O(N^2) 将主导运行时间,并且算法中的第二步和第三步是相当微不足道的。因此,尝试将所有 3 个步骤结合起来看起来并没有那么有用。

我想你可以做的一件小事(除非你已经这样做了)是在第三步中乘以总和的倒数而不是除以总和;除法比乘法昂贵得多。例如


double my_sum = sum(my_vector);
double tmp = 1 / my_sum;
for (i=...) {
   my_vector[i] *= tmp;
}

An optimized BLAS library is very tricky code, and you're unlikely to do better unless you are an expert in asm programming and understand the cache performance of your CPU, and are willing to spend a lot of time on testing various approaches. If you want to see how it's done, you can download and look at the source code for GOTO BLAS (implemented in asm, yes).

I'm not sure how to do any substantial optimization of your code. I suspect that already at N=100, the O(N^2) of the matrix-vector product will dominate the runtime and the second and third steps in your algorithm are rather insignificant. So trying to combine all 3 steps doesn't look that useful.

I suppose one minor thing you can do, unless you're already doing it, is that in the 3rd step multiply by the reciprocal of the sum instead of dividing by the sum; division is a lot more expensive than multiplication. E.g.


double my_sum = sum(my_vector);
double tmp = 1 / my_sum;
for (i=...) {
   my_vector[i] *= tmp;
}

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文