CUDA:使用预处理变量来指定问题大小的原因
我正在 Matlab mex-Files 中编码 CUDA。当您查看互联网上的 CUDA 示例甚至 nvidia 手册时,您经常会看到使用预处理变量来指定问题大小,例如向量加法的向量长度或类似的内容。我也这样编写程序:用于指定问题大小的预处理变量。我必须承认:我喜欢它,因为您可以在代码中的任何地方访问这些内容,例如作为循环中的限制或类似的东西,而不必通过参数显式地将它们传递给函数。
但我遇到了以下问题:我想针对几个不同的问题大小对程序进行测试,因此我每次都需要通过将预处理变量传递给编译器来再次编译代码。这不是问题,我已经编写了基准测试并且它可以工作。但事后我只是想知道为什么我选择这个版本而不是简单地通过运行时的用户输入来指定它。因此,我正在寻找人们可能想要使用预处理变量而不是简单地将问题大小传递给程序的原因。
谢谢!
I'm coding CUDA in Matlab mex-Files. When you look at CUDA examples on the internet or even manuals from nvidia, you often see the use of preprocessing variables to specify the problem size, e.g. the vector length for a vector addition or something like this. I coded my program also like this: Preprocessing Variables for specifying the problem size. And I have to admit it: I like it since you can access those everywhere in your code, e.g. as limits in a loop or something like this, without having to explicitly pass them via argument to the function.
But I ran into the following problem: I wanted to bench the program for several different problem sizes and thus I need to compile the code everytime again by passing the preprocessing-variable to the compiler. It's not a problem, I already coded the benchmark and it works. But I just wonder afterwards now, why I chose this version and did not simply specify it by a user input on runtime. And thus I'm looking for reasons one might want to use preprocessing variables instead of simply passing the problem size to the program.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当您在内核中编译问题大小常量时,编译器可以进行某些类型的优化,而如果大小仅在运行时已知,则编译器无法进行此类优化。完整循环展开就是一个明显的例子。
在其他情况下,例如共享内存数组大小,如果将大小编译进去会更清楚;否则,您必须在内核启动时传入总共享内存大小,并将该内存分解为您需要的共享数组数量。这工作得很好,但是如果您只需要静态声明(需要编译时大小),代码会更清晰。
When you compile-in problem-size constants in the kernel, then the compiler can make certain classes of optimizations that it can't if the sizes are only known at runtime. Full loop unrolling is an obvious example.
In other cases, for instance shared memory array sizes, it is a lot clearer if the sizes are compiled-in; otherwise you have to pass in the total shared memory size at kernel launch time and break that memory up into the number of shared arrays you need. That works fine, but the code is much clearer if you can just have static declarations, for which you need the compile-time sizes.
主要原因是,一般来说,问题的规模与 GPU 架构密切相关,例如每个块的线程数、块数、每个线程的共享内存量、每个线程的寄存器数等。一般来说,这些数字是所有这些都经过精心手工调整,以获得可用资源的最大利用,并且您无法轻松地动态更改问题大小,同时仍保持最佳性能。
The main reason is that in general the problem size will be intimately linked to the GPU architecture, e.g. number of threads per block, number of blocks, amount of shared memory per thread, number of registers per thread, etc. In general these numbers are all carefully hand tuned to get the maximum usage of available resources and you can't easily change the problem size dynamically while still maintaining optimum performance.