使用 GPU 的 CUDA 同时启动多个内核
是否可以同时启动两个执行独立任务的内核。例如,如果我有这个 Cuda 代码,
// host and device initialization
.......
.......
// launch kernel1
myMethod1 <<<.... >>> (params);
// launch kernel2
myMethod2 <<<.....>>> (params);
假设这些内核是独立的,是否有一个工具可以同时启动它们,为每个内核分配几个网格/块。 CUDA/OpenCL有这个规定吗?
Is it possible to launch two kernels that do independent tasks, simultaneously. For example if I have this Cuda code
// host and device initialization
.......
.......
// launch kernel1
myMethod1 <<<.... >>> (params);
// launch kernel2
myMethod2 <<<.....>>> (params);
Assuming that these kernels are independent, is there a facility to launch them at the same time allocating few grids/blocks for each. Does CUDA/OpenCL have this provision.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
只有具有 CUDA 计算能力 2.0 及更高版本的设备(即 Fermi)才能支持多个并发内核执行。请参阅 CUDA 3.0 编程指南的第 3.2.6.3 节,其中指出:
Only devices with CUDA compute capability 2.0 and better (i.e. Fermi) can support multiple simultaneous kernel executions. See section 3.2.6.3 of the CUDA 3.0 programming guide, which states:
您需要 SM 2.0 或更高版本才能实现并发内核。
要获得并发执行,您需要手动指示两个内核之间不存在依赖关系。这是因为编译器无法确定一个内核不会修改另一个内核中正在使用的数据,这可能是通过读取和写入同一缓冲区来实现的,这看起来很简单,但实际上更难检测,因为内部可能存在指针数据结构等。
为了表达独立性,您必须在不同的流中启动内核。 Triple-V 形语法中的第四个参数指定流,请查看编程指南或 SDK并发内核示例。
You will need SM 2.0 or above for concurrent kernels.
To get concurrent execution you need to manually indicate that there is no dependence between the two kernels. This is because the compiler cannot determine that one kernel will not modify data being used in the other, this could be by reading from and writing to the same buffer which seems simple enough, but is actually much harder to detect since there can be pointers inside data structures and so on.
To express the independence you must launch the kernels in different streams. The fourth parameter in the triple-chevron syntax specifies the stream, check out the Programming Guide or the SDK concurrentKernels sample.
CUDA 兼容性 2.1 = 最多 16 个并发内核
CUDA compatibility 2.1 = up to 16 Concurrent Kernels