在汇编中实现矩阵向量乘法
我有一种算法,可以一遍又一遍地执行线性代数的树步骤,
loop{
first I multiply a Vector and a Matrix,
Second I calculate the sum of elements in the Vector
and Thirdly I scale the vector using the sum, making sure the vectors elements scale to one.
}
我使用 BLAS 来执行操作,这有点快,但它需要对数据进行树运行,每个步骤一次。现在我想知道通过将这些步骤合并为一个(仅运行一次数据)是否可以获得一些好处。
有没有人有关于如何以最佳方式实现这些调用的经验,我的矩阵约为 100*100,向量有 100 个元素。
我认为该向量可以适合 8 个 128 字节 mmx 寄存器。使乘法变得相当快,有什么想法吗?
I've got an algorithm that does tree steps of linear algebra over and over again,
loop{
first I multiply a Vector and a Matrix,
Second I calculate the sum of elements in the Vector
and Thirdly I scale the vector using the sum, making sure the vectors elements scale to one.
}
I'm using BLAS to do the operations, and this is somewhat quick, but it takes tree runs over the data, one for each step. Now I'm wondering if there would be something to gain by combining the steps into one, there by only running over the data once.
Do anyone have any exprience on how to implement these calls in an optimal way, my matrix is about 100*100 and the vector is has 100 elements.
I'm thinking that the vector can fit into something like 8 128byte mmx registers. making the multiplications pretty fast, any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
优化的 BLAS 库是非常棘手的代码,除非您是 asm 编程专家并且了解 CPU 的缓存性能,并且愿意花费大量时间来测试各种方法,否则您不太可能做得更好。如果你想看看它是如何完成的,你可以下载并查看 GOTO BLAS 的源代码(在 asm 中实现,是的)。
我不确定如何对您的代码进行实质性优化。我怀疑已经在 N=100 时,矩阵向量乘积的 O(N^2) 将主导运行时间,并且算法中的第二步和第三步是相当微不足道的。因此,尝试将所有 3 个步骤结合起来看起来并没有那么有用。
我想你可以做的一件小事(除非你已经这样做了)是在第三步中乘以总和的倒数而不是除以总和;除法比乘法昂贵得多。例如
An optimized BLAS library is very tricky code, and you're unlikely to do better unless you are an expert in asm programming and understand the cache performance of your CPU, and are willing to spend a lot of time on testing various approaches. If you want to see how it's done, you can download and look at the source code for GOTO BLAS (implemented in asm, yes).
I'm not sure how to do any substantial optimization of your code. I suspect that already at N=100, the O(N^2) of the matrix-vector product will dominate the runtime and the second and third steps in your algorithm are rather insignificant. So trying to combine all 3 steps doesn't look that useful.
I suppose one minor thing you can do, unless you're already doing it, is that in the 3rd step multiply by the reciprocal of the sum instead of dividing by the sum; division is a lot more expensive than multiplication. E.g.