当前位置：文江博客话题详情

我可以在 cuda GPU 上的一个块中安装多少个处理器？

发布于 2024-09-02 13:22:22 字数 237 浏览 4 评论 0原文

我有三个问题要问

如果我在 cuda 中只创建一个线程块并在其上执行并行程序，那么是否有可能将多个处理器分配给单个块，以便我的程序获得多处理器平台的一些好处？更清楚地说，如果我只使用一个线程块，那么将为其分配多少个处理器，因为据我所知（我可能误解了它），一个扭曲仅被赋予单个处理元素。
我可以同步不同块的线程吗？如果是，请给出一些提示。
如何找出经纱尺寸？它是针对特定硬件固定的吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

原来分手还会想你 2024-09-09 13:22:22

1 是否有可能将多个处理器分配给单个块，以便我的程序获得多处理器平台的一些好处

简单答案：否。CUDA

编程模型将一个线程块映射到一个多处理器 (SM)；该块不能分割到两个或多个多处理器上，并且一旦启动，它就不会从一个多处理器移动到另一个多处理器。

正如您所看到的，CUDA 提供了__syncthreads() 来允许块内的线程进行同步。这是一个成本非常低的操作，部分原因是块内的所有线程都非常接近（在同一个 SM 上）。如果允许他们分裂，那么这一切就不再可能了。此外，块内的线程可以通过共享共享内存中的数据进行协作；共享内存是 SM 本地的，因此分割块也会破坏这一点。

2 可以同步不同块的线程吗？

不是真的没有。您可以做一些事情，例如获取最后一个块来执行一些特殊操作（请参阅 SDK 中的 threadFenceReduction 示例），但一般同步实际上是不可能的。当您启动网格时，您无法控制多处理器上的块调度，因此任何进行全局同步的尝试都将面临死锁的风险。

3 如何找出经纱尺寸？它是针对特定硬件固定的吗？

是的，它是固定的。事实上，对于当前所有支持 CUDA 的设备（1.x 和 2.0），它都固定为 32。如果您依赖于扭曲大小，那么您应该通过检查扭曲大小来确保向前兼容性。

在设备代码中，您可以仅使用特殊变量warpSize。在主机代码中，您可以使用以下命令查询特定设备的扭曲大小：

cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;

result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
    ...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
    ...
}

int warpSize = prop.warpSize;

1 is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform

Simple answer: No.

The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.

As you have seen, CUDA provides __syncthreads() to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.

2 can I synchronize the threads of different blocks ?

Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.

3 How to find out warp size ? it is fixed for a particular hardware ?

Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.

In device code you can just use the special variable warpSize. In host code you can query the warp size for a specific device with:

cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;

result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
    ...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
    ...
}

int warpSize = prop.warpSize;

回复收藏 0 原文

浴红衣 2024-09-09 13:22:22

从 cuda 2.3 开始，每个线程块一个处理器。 cuda 3/Fermi 处理器中可能有所不同，我不记得
不是真的，但是......（根据您的要求，您可能会找到解决方法）
阅读这篇文章 CUDA：同步线程

回复收藏 0 原文

时光无声 2024-09-09 13:22:22

＃3。您可以使用 cuDeviceGetProperties 查询 SIMDWidth - 参见文档

回复收藏 0 原文

茶底世界 2024-09-09 13:22:22

要跨多个块同步线程（至少就内存更新而言），您可以使用新的 __threadfence_system() 调用，该调用仅在 Fermi 设备（计算能力 2.0 及更高版本）上可用。 CUDA 3.0 的 CUDA 编程指南中描述了此函数。

回复收藏 0 原文

格子衫的從容 2024-09-09 13:22:22

我可以使用以下方法同步不同块的线程吗？请告诉我这个方法是否有任何问题（我认为会有一些问题，但由于我在 cuda 方面经验不多，我可能没有考虑到一些事实）



__global__ void sync_func(int *glob_var){
int i = 0 ; //local variable to each thread
int total_threads = blockDim.x *threadDim.x
while(*glob_var != total_threads){
    if(i == 0){
      atomicAdd(int *glob_var, 1);
      i = 1;
    }
}

执行所有线程同时执行的代码；
}

Can I synchronize threads of different block with following approach. Please do tell me if there is any problem in this approch (I think there will be some but since I'm not much experienced in cuda I might have not considered some facts)



__global__ void sync_func(int *glob_var){
int i = 0 ; //local variable to each thread
int total_threads = blockDim.x *threadDim.x
while(*glob_var != total_threads){
    if(i == 0){
      atomicAdd(int *glob_var, 1);
      i = 1;
    }
}

execute the code which is to be executed at the same time by all threads;
}

回复收藏 0 原文

~没有更多了~