__sshfl_xx_sync（）带有掩模的内在需要附加__syncwarp（）？

发布于 2025-01-28 19:22:57 字数 1070 浏览 2 评论 0原文

__ shfl_xx_sync（）指令，只有某些车道参加，需要附加__ syncwarp（）指令，还是设置掩码足够？

我无法提供一个工作最小的示例，因为它是非常长的机密代码，并且仅在某些运行/构建配置中出现错误。

该代码基本上看起来很像以下内容：

if (threadIdx.x >= 30) {
    temp.x = __shfl_up_sync(0xC0000000, x, 1);
    temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);

发行版构建工作正常；随着调试构建车道30和31的构建（根据调试器和SASS）与其他车道不同。

当我引入__ SyncWarp（）时，还会通过调试构建。我希望这个问题肯定是解决的！？！

我在前两个洗牌说明中使用口罩，表明只有30和31号车道参加。如果调度程序决定首先执行0至29的车道并转到第二次洗牌指令（所有参与车道），会发生什么？然后那些洗牌说明等待30和31的车道。这些线程随后到达上层散装说明。可以区分散装吗？

如果需要__ syncwarp（）：为什么它的反应会有所不同，那么带有掩码0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff本身会做出不同的反应？

因为它是其他类型的？（Shuffle Sync而不是正常同步？）还是该程序以这种方式工作了？

（__ syncwarp（）内在可能在这里（出于性能原因），因为线程在此点收敛。）

如果__ __ syncwarp（）确保内核不悬挂？除__ syncwarp（）外，是否通常还有另一种推荐方法？

我正在Turing RTX 2060移动设备上运行此操作（并且与Visual Studio一起进行了调试）。

原文

Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough?

I cannot provide a working minimal example, as it is very long and confidential code and the error appeared only in certain run/build configurations.

The code looks basically like the following:

if (threadIdx.x >= 30) {
    temp.x = __shfl_up_sync(0xC0000000, x, 1);
    temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);

Release builds worked fine; with debug builds lanes 30 and 31 waited (according to debugger and SASS) at a different sync instruction than the other lanes.

When I introduced __syncwarp() also debug builds run through. And I hope this problem is definitely fixed!?!

I am using a mask in the first two shuffle instructions indicating that only lanes 30 and 31 participate. What happens, if the scheduler decides that lanes 0 to 29 are executed first and goes to the second shuffle instruction (with all participating lanes)? Then those shuffle instructions wait for the lanes 30 and 31. Those threads then get to the upper shuffle instructions. Can the shuffles be distinguished?

If the __syncwarp() is needed: Why would it react differently then the shuffle instruction with mask 0xffffffff itself?

Because it is of other type? (shuffle sync instead of normal sync?) Or was it by accident that the program worked in this way?

(The __syncwarp() intrinsic is probably useful here anyway (for performance reasons), as the threads converge at that point.)

If __syncwarp() is not enough: How to make sure the kernel does not hang? Is there generally another recommended way than __syncwarp()?

I am running this on Turing RTX 2060 Mobile (and debugged is with Visual Studio).

分享到QQ

分享到微博