CUDA - 简单的矩阵加法/求和运算

发布于 2024-10-25 20:10:27 字数 185 浏览 7 评论 0原文

这应该非常简单,但我找不到详尽的答案:

我需要使用矩阵执行 A+B = C,其中 A 和 B 是两个未知大小的矩阵(它们的最大值可能是 2x2 或 20.000x20.000)

我应该使用带有Sgemm函数的CUBLAS来计算吗?

我需要可实现的最大速度,所以我想到了 CUBLAS 库,它应该得到很好的优化

This should be very simple but I could not find an exhaustive answer:

I need to perform A+B = C with matrices, where A and B are two matrices of unknown size (they could be 2x2 or 20.000x20.000 as greatest value)

Should I use CUBLAS with Sgemm function to calculate?

I need the maximum speed achievable so I thought of CUBLAS library which should be well-optimized

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

晨敛清荷 2024-11-01 20:10:27

对于任何类型的技术计算,您应该始终使用可用的优化库。现有的库,由数百名其他人使用,将比您自己做的任何事情得到更好的测试和更好的优化,并且您自己不花时间编写(调试和优化)该函数的时间可以更好地花在工作上您想要解决的实际高级问题,而不是重新发现其他人已经实现的东西。这只是劳动的基本专业化;专注于您想要解决的计算问题,让那些花时间专业编写 GPGPU 矩阵例程的人为您做这件事。

只有当您确定现有的库不能满足您的需要时(也许它们解决了过于普遍的问题,或者做出了某些不适合您的情况的假设),您才应该推出自己的库。

我同意其他人的观点,在这种特殊情况下,操作非常简单,并且可以DIY;但是,如果您在添加完这些矩阵后要对它们进行其他操作,那么最好针对您所在的任何平台使用优化的 BLAS 例程。

For any sort of technical computing, you should always use optimized libraries when available. Existing libraries, used by hundreds of other people, are going to be better tested and better optimized than anything you do yourself, and the time you don't spend writing (and debugging, and optimizing) that function yourself can be better spent working on the actual high-level problem you want to solve instead of re-discovering things other people have already implemented. This is just basic specialization of labour stuff; focus on the compute problem you want to solve, and let people who spend their days professionally writing GPGPU matrix routines do that for you.

Only when you are sure that existing libraries don't do what you need -- maybe they solve too general a problem, or make certain assumptions that don't hold in your case -- should you roll your own.

I agree with the others that in this particular case, the operation is pretty straightforward and it's feasible to DIY; but if you're going to be doing anything else with those matricies once you're done adding them, you'd be best off using optimized BLAS routines for whatever platform you're on.

一生独一 2024-11-01 20:10:27

您想要做的事情在 CUDA 中实现起来很简单,并且带宽有限。

What you want to do would be trivial to implement in CUDA and will be bandwidth limited.

梦开始←不甜 2024-11-01 20:10:27

从 CUBLAS5.0 开始,cublasgeam 就可以用于此目的。它计算 2 个可选转置矩阵的加权和。

And since CUBLAS5.0, cublasgeam can be used for that. It computes the weighted sum of 2 optionally transposed matrices.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文