交叉点:施特拉森算法

发布于 2024-10-26 20:10:08 字数 224 浏览 1 评论 0原文

就效率而言,施特拉森算法应停止递归并应用乘法的最佳交叉点是什么?

我知道这与具体的实现和硬件密切相关,但应该有某种指导方针或某人针对一般情况的一些实验结果。

在网上搜索一下并询问一些他们倾向于认为是

n = 64; 

 n = 32;

任何人都可以验证/拒绝这些结果的人吗?

What is the optimal crossover point under which Strassen's Algorithm should stop the recursion and apply the multiplication, in terms of efficiency?

I know this is closely connected to the specific implementation and hardware but there should be some kind of guideline or some experimental results from someone for a general case.

Searching a bit online and asking some people they tend to think it's

n = 64; 

or

 n = 32;

Can anyone verify / reject these results?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

呆° 2024-11-02 20:10:08

这应该在每台机器的基础上进行调整(有点像 ATLAS 那样)。这种优化对于相当大的矩阵来说是值得的:如果你自己编码,并将其与例如进行比较。供应商 BLAS 实现,那么你会发现一个相当大的 n。

Strassen 算法的内存要求也是需要权衡的因素。

This should be tuned on a per machine basis (a bit like ATLAS does). This kind of optimizations pay off for quite large matrices: if you code it yourself, and compare it to eg. a vendor BLAS implementation, then you would find a quite large n.

Memory requirements for Strassen algorithm are also something to weigh.

ˉ厌 2024-11-02 20:10:08

在我的双核 2.66 Mac Pro 上使用
[我的实现][1]
交叉小至 n = 16。
事实上,我的实现比大规模的传统算法快方式
矩阵——我不知道为什么——我认为它对缓存更友好,因为它
专注于较小的本地化数据。事实上,我正要发布一个关于这个的问题。

http://ezekiel.vancouver.wsu.edu/~cs330 /lectures/linear_algebra/mm/mm.c

On my dual core 2.66 Mac Pro using
[my implementation][1]
the cross-over is as small as n = 16.
In fact, my implementation is way faster than the conventional algorithm for large
matrices -- I am not sure why -- I think it is more cache-friendly since it
focuses on smaller localized data. In fact, I am about to post a question about this.

http://ezekiel.vancouver.wsu.edu/~cs330/lectures/linear_algebra/mm/mm.c

梦初启 2024-11-02 20:10:08

经过大量测试后,我得出的结论是,至少对于我的处理器来说,施特拉森算法的最佳交叉点是n = 128

我的处理器是:英特尔酷睿 i5-430M。另外,有趣的是,对于 4 线程 CPU,我的实现在 numberOfProcesses = 8 上比 numberOfProcesses = 4 工作得更好一些。我不知道这是如何或为何发生的。我猜想,由于通过渠道进行更多的通信,它会产生更大的开销,而且由于它们不能同时工作,所以速度肯定会慢一些。显然我错了。顺便说一句,如果有人可以解释这一点,请留言,仅供记录。

谢谢。

After a lot of tests, I have concluded that at least for my processor the optimal crossover point under which Strassen's Algorithm is n = 128.

My processor is: Intel Core i5-430M. Also, interestingly enough for a 4-threaded CPU, my implementation worked a little bit better for numberOfProcesses = 8 than numberOfProcesses = 4. I have no clue how or why that happened. I'd guess that due to more communication through channels it would have a bigger overhead and since they cannot all work at the same time it would definitely be a bit slower. Apparently I was wrong. If anyone can explain this btw please drop a line, just for the record.

Thanks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文