静态与动态 CUDA 共享内存分配的性能
我有 2 个内核,它们的功能完全相同。其中一种静态分配共享内存,另一种在运行时动态分配内存。我将共享内存用作二维数组。因此,对于动态分配,我有一个计算内存位置的宏。现在,2
内核生成的结果完全相同。然而,我从两个内核获得的计时结果相差 3
倍!静态内存分配要快得多。很抱歉我无法发布任何代码。有人可以为此给出理由吗?
I have 2
kernels that do exactly the same thing. One of them allocates shared memory statically while the other allocates the memory dynamically at run time. I am using the shared memory as 2D array. So for the dynamic allocation, I have a macro that computes the memory location. Now, the results generated by the 2
kernels are exactly the same. However, the timing results I got from both kernels are 3
times apart! The static memory allocation is much faster. I am sorry that I can't post any of my code. Can someone give a justification for this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我没有证据表明静态共享内存分配比动态共享内存分配更快。正如上面的评论所证明的那样,如果没有复制者,就不可能回答您的问题。至少在下面的代码的情况下,当使用静态或动态共享内存分配运行时,同一内核的计时是完全相同的:
可能的原因是两个内核的反汇编代码完全相同,即使将 int
N = 1000000;
替换为int N = rand();
也不会改变。I have no evidence that static shared memory allocation is faster than dynamic shared memory allocation. As was evidenced in the comments above, it would be impossible to answer your question without a reproducer. In at least the case of the code below, the timings of the same kernel, when run with static or dynamic shared memory allocations, are exactly the same:
The possible reason for that is due to the fact that the disassembled codes for the two kernels are exactly the same and do not change even on replacing int
N = 1000000;
withint N = rand();
.