我可以在 cuda GPU 上的一个块中安装多少个处理器?
我有三个问题要问
- 如果我在 cuda 中只创建一个线程块并在其上执行并行程序,那么是否有可能将多个处理器分配给单个块,以便我的程序获得多处理器平台的一些好处?更清楚地说,如果我只使用一个线程块,那么将为其分配多少个处理器,因为据我所知(我可能误解了它),一个扭曲仅被赋予单个处理元素。
- 我可以同步不同块的线程吗?如果是,请给出一些提示。
- 如何找出经纱尺寸?它是针对特定硬件固定的吗?
I have three questions to ask
- If I create only one block of threads in cuda and execute the parallel program on it then is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform ? To be more clear, If I use only one block of threads then how many processors will be allocated to it because so far as I know (I might have misunderstood it) one warp is given only single processing element.
- can I synchronize the threads of different blocks ? if yes please give some hints to do it.
- How to find out warp size ? it is fixed for a particular hardware ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
简单答案:否。CUDA
编程模型将一个线程块映射到一个多处理器 (SM);该块不能分割到两个或多个多处理器上,并且一旦启动,它就不会从一个多处理器移动到另一个多处理器。
正如您所看到的,CUDA 提供了
__syncthreads()
来允许块内的线程进行同步。这是一个成本非常低的操作,部分原因是块内的所有线程都非常接近(在同一个 SM 上)。如果允许他们分裂,那么这一切就不再可能了。此外,块内的线程可以通过共享共享内存中的数据进行协作;共享内存是 SM 本地的,因此分割块也会破坏这一点。不是真的没有。您可以做一些事情,例如获取最后一个块来执行一些特殊操作(请参阅 SDK 中的 threadFenceReduction 示例),但一般同步实际上是不可能的。当您启动网格时,您无法控制多处理器上的块调度,因此任何进行全局同步的尝试都将面临死锁的风险。
是的,它是固定的。事实上,对于当前所有支持 CUDA 的设备(1.x 和 2.0),它都固定为 32。如果您依赖于扭曲大小,那么您应该通过检查扭曲大小来确保向前兼容性。
在设备代码中,您可以仅使用特殊变量
warpSize
。在主机代码中,您可以使用以下命令查询特定设备的扭曲大小:Simple answer: No.
The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.
As you have seen, CUDA provides
__syncthreads()
to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.
Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.
In device code you can just use the special variable
warpSize
. In host code you can query the warp size for a specific device with:从 cuda 2.3 开始,每个线程块一个处理器。 cuda 3/Fermi 处理器中可能有所不同,我不记得
不是真的,但是......(根据您的要求,您可能会找到解决方法)
阅读这篇文章 CUDA:同步线程
As of cuda 2.3 one processor per thread block. It might be different in cuda 3/Fermi processors, I do not remember
not really but... (depending on your requirements you may find workaround)
read this post CUDA: synchronizing threads
#3。您可以使用 cuDeviceGetProperties 查询 SIMDWidth - 参见文档
#3. You can query SIMDWidth using cuDeviceGetProperties - see doc
要跨多个块同步线程(至少就内存更新而言),您可以使用新的 __threadfence_system() 调用,该调用仅在 Fermi 设备(计算能力 2.0 及更高版本)上可用。 CUDA 3.0 的 CUDA 编程指南中描述了此函数。
To synchronize threads across multiple blocks (at least as far as memory updates are concerned), you can use the new
__threadfence_system()
call, which is only available on Fermi devices (Compute Capability 2.0 and better). This function is described in the CUDA Programming guide for CUDA 3.0.我可以使用以下方法同步不同块的线程吗?请告诉我这个方法是否有任何问题(我认为会有一些问题,但由于我在 cuda 方面经验不多,我可能没有考虑到一些事实)
执行所有线程同时执行的代码;
}
Can I synchronize threads of different block with following approach. Please do tell me if there is any problem in this approch (I think there will be some but since I'm not much experienced in cuda I might have not considered some facts)
execute the code which is to be executed at the same time by all threads;
}