CUDA如何在运行时在内核中的共享内存中创建数组?
我有大量线程运行的任务,每个线程都执行一个小的矩阵乘法。所有小矩阵都已加载到全局内存中。我希望通过让每个线程将其小矩阵加载到共享内存中,然后计算乘积来提高性能。但问题是我不知道编译时矩阵的大小。因此,我无法像 __shared__ double mat1[XSIZE][YSIZE] 中那样创建变量。在 PC 上,我会进行动态分配。但我不知道是否可以在共享内存上做到这一点。如果在内核中调用 malloc 只会在全局内存中分配(假设这样的调用是可能的),那也没有帮助。
有没有办法在内核运行时声明数组?还有其他方法可以解决这个问题吗?
I have the task of large number of threads running, each doing a small matrix multiplication. All the small matrices have been loaded to the global memory. I wish to improve performance by letting each thread load its small matrices into shared memory, and then compute the product. But the problem is that I do not know the sizes of the matrices during compile time. So I cannot create variables as in __shared__ double mat1[XSIZE][YSIZE]
. On PC, I would have made a dynamic allocation. But I do not know if I could do it on the shared memory. If calling malloc
in a kernel would allocate only in global memory (assuming such a call is possible), that does not help either.
Is there a way to declare arrays during runtime in kernel? Is there any other way to resolve this problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在 CUDA 中声明动态大小的共享内存分配,如下所示
然后像这样启动内核
这在 CUDA 编程指南中有更详细的讨论。
You can declare dynamically sized shared memory allocations in CUDA, like this
And then launch your kernel like this
This is discussed in more detail in the CUDA programming guide.