cuda控制发散

发布于 2024-09-30 16:49:05 字数 347 浏览 3 评论 0原文

假设我有 3 个共享内存阵列：ab c。我不确定以下线程安排是否会导致控制发散，

if (threadIdx < 64)
{
    if (threadIdx == 1)
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*a[threadIdx];
    else
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*b[threadIdx];
}

如果确实如此，会对性能产生多严重的影响？有什么有效的方法来处理这个问题吗？谢谢

原文

say I have 3 share memory array: a b c. I am not sure if following thread arrangement will cause control divergence or not,

if (threadIdx < 64)
{
    if (threadIdx == 1)
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*a[threadIdx];
    else
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*b[threadIdx];
}

if it does, how bad is it gonna affect performance? is there any efficient way to handle the problem? thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

跨年 2024-10-07 16:49:05

根据块的尺寸，第一个条件 threadIdx.x < 64（注意.x）可能根本不会导致任何分歧。例如，如果您有一个尺寸为 (128,1,1) 的块，那么前两个扭曲（以锁步执行的 32 线程组）将进入 if< /code> 阻止，而最后两个将绕过它。由于整个扭曲向一个方向或另一个方向移动，因此没有发散。

像 threadIdx.x == 1 这样的条件会导致发散，但它的成本非常适中。事实上，在许多情况下，CUDA 将能够使用单个指令来实现条件表达式。例如，像 min、max 和 abs 这样的操作通常使用单个指令来实现，并且绝对不会导致分歧。您可以在 PTX 手册。

一般来说，您不应该过度担心像上面这样的适度的控制流分歧。在某些情况下，分歧会让你陷入困境，比如

if (threadIdx.x % 4 == 0)
  // do expensive operation
else if (threadIdx.x % 4 == 1)
  // do expensive operation
else if (threadIdx.x % 4 == 2)
  // do expensive operation
else
  // do expensive operation

“昂贵的操作”需要 10 或 100 条指令。在这种情况下，if 语句引起的分歧会使效率降低 75%。

请记住，与 (1) 高级算法选择和 (2) 内存局部性/合并相比，线程分歧的问题要少得多。很少有 CUDA 程序员应该关心示例中的分歧。

Depending on the dimensions of your block the first condition threadIdx.x < 64 (note the .x) may not cause any divergence at all. For example, if you have a block with dimensions (128,1,1) then the first two warps (32-threads groups which execute in lock-step) will enter into the if block while the last two will bypass it. Since the whole warp goes one way or the other there is no divergence.

A conditional like threadIdx.x == 1 will cause divergence, but it will have very modest cost. Indeed, in many cases CUDA will be able to implement the conditional expression with a single instruction. For instance, operations like min, max, and abs will generally be implemented with a single instruction and cause absolutely no divergence. You can read about such instructions in the PTX Manual.

In general you should not be overly concerned about modest amounts of control-flow divergence like the above. Where divergence will bite you in in situations like

if (threadIdx.x % 4 == 0)
  // do expensive operation
else if (threadIdx.x % 4 == 1)
  // do expensive operation
else if (threadIdx.x % 4 == 2)
  // do expensive operation
else
  // do expensive operation

where an "expensive operation" would be one that required 10s or 100s of instructions. In this case the divergence caused by the if statements would reduce efficiency by 75%.

Keep in mind that thread divergence is a much lesser concern than (1) high-level algorithm choices and (2) memory locality/coalescing. Very few CUDA programmers should ever be concerned with the sort of divergence in your examples.

回复收藏 0 原文

装纯掩盖桑 2024-10-07 16:49:05

如果每个块有多个线程，我预计每个块的一个扭曲会出现分歧（无论哪个块保存线程 1）。

但是，两个循环之间的区别仅在于要访问的内存，而不是指令。所以，我会这样做......

if (threadIdx.x < 64)
{
    //this conditional might diverge
    if (threadIdx.x == 1)
        ptr = a;
    else
        ptr = b;

    //but obviously this part will not
    for (int i = 0; i < N; i++)
        c += a[threadIdx]*ptr[threadIdx];
}

If there is more than one thread per block, I would expect divergence in one warp of each block (whichever block holds thread 1).

But, the difference between your two loops is only in which memory to access, not in instructions. So, I would do this instead...

if (threadIdx.x < 64)
{
    //this conditional might diverge
    if (threadIdx.x == 1)
        ptr = a;
    else
        ptr = b;

    //but obviously this part will not
    for (int i = 0; i < N; i++)
        c += a[threadIdx]*ptr[threadIdx];
}

回复收藏 0 原文

~没有更多了~