CUDA 卷积 - 不可分离内核
我需要实现一个具有不可分离内核的图像卷积的高效版本(因此 CUDA 的 sdk 仅适用于 FFT 示例,但明确指出它仅适用于大内核大小)
除了从头开始实现它之外在我看来,我的需要是对先验未知大小的矩阵和内核进行操作(它们可以是 10x10 作为 20.000x20.000,我根本无法预测它)
您对 FFT 示例有何建议? (如果这是您的最佳选择,请向我提供一些好的观点来开始弄清楚它是如何工作的)
对于第二个选择(我自己手动实现卷积),最大化内存合并的建议是什么?
I need to implement an efficient version of an image convolution with non-separable kernels (so CUDA's sdk is useful just for the FFT example, but it is clearly stated that it works great only for big kernel sizes)
Aside from implementing it from scratch as comes to my mind, my need is to operate on priori-unknown-sizes matrices and kernels (they can be 10x10 as 20.000x20.000, I simply can't predict it)
What are your suggestions regarding the FFT example? (if this is your best pick, please provide me some good point to start figuring out how that works)
And for the second pick (manually implementing the convolution by myself), what the suggestions to maximize memory coalescence?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我对 GPU 的建议:
首先把它做好。熟悉您所使用的算法
想要实施
首先是在 GPU 上,然后是在 CPU 上。你将不得不面对这么多
更多低级细节,因此了解输出必须是什么非常重要。
速度要快。如果可以使用 FFT 方法,它是最快的一种
(大多数情况下)。
为了实现您的第一个目标,我建议您尝试使用 OpenCv 来实现它。
它有一个非常好的 python 包装器,并提供过滤框架
确定你的结果以及如何使用 OpenCv 实现该结果,测试是否可以
使用 FFT 执行相同的操作。将整个系统移植到 GPU 上会容易得多
My suggestion with the gpu:
First make it right. Get confortable with the algorithm that you
want to implement
on the GPU first on the CPU. You will have to deal with so many
more low level details, so is important that you know what the output must be.
Make it fast. FFT approach is the fastest one if you can use it
(most of the cases).
To reach your first objective I advise you to try to implement it with OpenCv.
It has a very nice wrapper for python and provide a framework for filtering
Once you are sure of your result and how you achieve that with OpenCv, test if you can
do the same using FFT. Porting the whole on the GPU would be much easier
您可能想看看 theano 中卷积的实现(它们使用非基于 FFT 的内核)......或者只使用 theano。
You might want to look at the implementation of convolution in theano (they use non-FFT-based kernels)...or just use theano.