交叉点:施特拉森算法
就效率而言,施特拉森算法应停止递归并应用乘法的最佳交叉点是什么?
我知道这与具体的实现和硬件密切相关,但应该有某种指导方针或某人针对一般情况的一些实验结果。
在网上搜索一下并询问一些他们倾向于认为是
n = 64;
或
n = 32;
任何人都可以验证/拒绝这些结果的人吗?
What is the optimal crossover point under which Strassen's Algorithm should stop the recursion and apply the multiplication, in terms of efficiency?
I know this is closely connected to the specific implementation and hardware but there should be some kind of guideline or some experimental results from someone for a general case.
Searching a bit online and asking some people they tend to think it's
n = 64;
or
n = 32;
Can anyone verify / reject these results?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这应该在每台机器的基础上进行调整(有点像 ATLAS 那样)。这种优化对于相当大的矩阵来说是值得的:如果你自己编码,并将其与例如进行比较。供应商 BLAS 实现,那么你会发现一个相当大的 n。
Strassen 算法的内存要求也是需要权衡的因素。
This should be tuned on a per machine basis (a bit like ATLAS does). This kind of optimizations pay off for quite large matrices: if you code it yourself, and compare it to eg. a vendor BLAS implementation, then you would find a quite large n.
Memory requirements for Strassen algorithm are also something to weigh.
在我的双核 2.66 Mac Pro 上使用
[我的实现][1]
交叉小至 n = 16。
事实上,我的实现比大规模的传统算法快方式
矩阵——我不知道为什么——我认为它对缓存更友好,因为它
专注于较小的本地化数据。事实上,我正要发布一个关于这个的问题。
http://ezekiel.vancouver.wsu.edu/~cs330 /lectures/linear_algebra/mm/mm.c
On my dual core 2.66 Mac Pro using
[my implementation][1]
the cross-over is as small as n = 16.
In fact, my implementation is way faster than the conventional algorithm for large
matrices -- I am not sure why -- I think it is more cache-friendly since it
focuses on smaller localized data. In fact, I am about to post a question about this.
http://ezekiel.vancouver.wsu.edu/~cs330/lectures/linear_algebra/mm/mm.c
经过大量测试后,我得出的结论是,至少对于我的处理器来说,施特拉森算法的最佳交叉点是
n = 128
。我的处理器是:英特尔酷睿 i5-430M。另外,有趣的是,对于 4 线程 CPU,我的实现在
numberOfProcesses = 8
上比numberOfProcesses = 4
工作得更好一些。我不知道这是如何或为何发生的。我猜想,由于通过渠道进行更多的通信,它会产生更大的开销,而且由于它们不能同时工作,所以速度肯定会慢一些。显然我错了。顺便说一句,如果有人可以解释这一点,请留言,仅供记录。谢谢。
After a lot of tests, I have concluded that at least for my processor the optimal crossover point under which Strassen's Algorithm is
n = 128
.My processor is: Intel Core i5-430M. Also, interestingly enough for a 4-threaded CPU, my implementation worked a little bit better for
numberOfProcesses = 8
thannumberOfProcesses = 4
. I have no clue how or why that happened. I'd guess that due to more communication through channels it would have a bigger overhead and since they cannot all work at the same time it would definitely be a bit slower. Apparently I was wrong. If anyone can explain this btw please drop a line, just for the record.Thanks.