用short替换int对CUDA的性能有帮助吗
假设我们有足够的全局内存。用 short
替换 int
是否可以提高 CUDA 的性能? (如short
节省了共享内存、寄存器等的使用)
欢迎建议。谢谢。
assume that we have enough global memory. Does replacing int
with short
improve the performance in CUDA? (like short
saves the usage of shared memory, registers, etc)
Advices are welcomed. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在共享内存中使用
short
很可能会由于存储体冲突而降低性能,直到您使用short2
。另外,据我所知,GPU 上的所有寄存器都是 32 位的,因此使用
short
不太可能减少寄存器的使用。Using
short
in shared memory will most likely reduce performance due to bank-conflicts, until you useshort2
.Also, as far as I know, all registers on GPU are 32-bit, so it's unlikely that using
short
would reduce register usage.取决于:
如果您的程序受内存限制,那么是将输入传输为短路可能会有所帮助。
如果您的内核是计算限制的,则更有可能是否,因为内核每次都必须执行额外的操作来从short 转换为int,然后再转换回short。
Depends:
If your program is memory bound then Yes transferring the input as shorts could be beneficial.
If your kernel is computation bound is more likely to be No because the kernel have to do an extra operation to convert from short to int and then back to short each time.
Tesla 级硬件 (SM 1.x) 对“半寄存器”具有令人惊讶的丰富支持,因此您可能会在这些平台上使用 Short 而不是 int 来获得一些好处。您可以通过使用 cuobjdump 查看 cubin 中的微代码来确认。但费米取消了这种支持。
在 SM 2.1 中,NVIDIA 添加了对“视频”指令的支持,这些指令在 32 位寄存器上实现 32 位宽 SIMD 操作 - 请参阅 PTX 2.1 规范的第 8.7.9 节。
http://developer.download.nvidia.com /compute/cuda/3_1/toolkit/docs/ptx_isa_2.1.pdf
Tesla-class hardware (SM 1.x) has surprisingly rich support for "half registers," so you might get some mileage from using short instead of int on those platforms. You can confirm by using cuobjdump to look at the microcode in the cubin. But Fermi removed that support.
With SM 2.1, NVIDIA added support for "video" instructions that implement 32-bit-wide SIMD operations on 32-bit registers - see section 8.7.9 of the PTX 2.1 spec.
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/ptx_isa_2.1.pdf