__sshfl_xx_sync()带有掩模的内在需要附加__syncwarp()?
__ shfl_xx_sync()
指令,只有某些车道参加,需要附加__ syncwarp()
指令,还是设置掩码足够?
我无法提供一个工作最小的示例,因为它是非常长的机密代码,并且仅在某些运行/构建配置中出现错误。
该代码基本上看起来很像以下内容:
if (threadIdx.x >= 30) {
temp.x = __shfl_up_sync(0xC0000000, x, 1);
temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);
发行版构建工作正常;随着调试构建车道30和31的构建(根据调试器和SASS)与其他车道不同。
当我引入__ SyncWarp()
时,还会通过调试构建。我希望这个问题肯定是解决的!?!
我在前两个洗牌说明中使用口罩,表明只有30和31号车道参加。如果调度程序决定首先执行0至29的车道并转到第二次洗牌指令(所有参与车道),会发生什么?然后那些洗牌说明等待30和31的车道。这些线程随后到达上层散装说明。可以区分散装吗?
如果需要__ syncwarp()
:为什么它的反应会有所不同,那么带有掩码0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
本身会做出不同的反应?
因为它是其他类型的? (Shuffle Sync而不是正常同步?)还是该程序以这种方式工作了?
(__ syncwarp()
内在可能在这里(出于性能原因),因为线程在此点收敛。)
如果__ __ syncwarp()确保内核不悬挂?除
__ syncwarp()
外,是否通常还有另一种推荐方法?
我正在Turing RTX 2060移动设备上运行此操作(并且与Visual Studio一起进行了调试)。
Do __shfl_xx_sync()
instructions, where only some lanes participate, need an additional __syncwarp()
instruction, or is setting a mask enough?
I cannot provide a working minimal example, as it is very long and confidential code and the error appeared only in certain run/build configurations.
The code looks basically like the following:
if (threadIdx.x >= 30) {
temp.x = __shfl_up_sync(0xC0000000, x, 1);
temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);
Release builds worked fine; with debug builds lanes 30 and 31 waited (according to debugger and SASS) at a different sync instruction than the other lanes.
When I introduced __syncwarp()
also debug builds run through. And I hope this problem is definitely fixed!?!
I am using a mask in the first two shuffle instructions indicating that only lanes 30 and 31 participate. What happens, if the scheduler decides that lanes 0 to 29 are executed first and goes to the second shuffle instruction (with all participating lanes)? Then those shuffle instructions wait for the lanes 30 and 31. Those threads then get to the upper shuffle instructions. Can the shuffles be distinguished?
If the __syncwarp()
is needed: Why would it react differently then the shuffle instruction with mask 0xffffffff
itself?
Because it is of other type? (shuffle sync instead of normal sync?) Or was it by accident that the program worked in this way?
(The __syncwarp()
intrinsic is probably useful here anyway (for performance reasons), as the threads converge at that point.)
If __syncwarp()
is not enough: How to make sure the kernel does not hang? Is there generally another recommended way than __syncwarp()
?
I am running this on Turing RTX 2060 Mobile (and debugged is with Visual Studio).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不,您不需要
__ syncwarp()
在这里。 cuda从eg__ shfl_up()
转到__ shfl_up_sync()
以避免这种情况。我认为问题在于您正在尝试从不参与呼叫的线程中取消数据,IE线程30试图从线程29获取数据,因此线程29必须参与。来自 docs 。尽管这种解释仍然不令人满意,但您似乎会陷入僵局,而不是不确定的价值。但是也许这是对调试构建的想要的行为?
话虽这么说,我不太确定如何优雅地执行此操作,因为在条件和面具中包括线程29只会将问题转移到29中,试图从28中获取数据。在示例>所有线程,然后有条件使用结果。
我最好的猜测是,您希望线程29参与,但是三角洲为0。
您可能还需要使用
__ ballot_sync()
来检索蒙版,如列表 blogpost 避免错误指定掩码,每当有条件更改时,都需要更改。No, you should not need a
__syncwarp()
here. CUDA went from e.g.__shfl_up()
to__shfl_up_sync()
to avoid this. I think the problem is that you are trying to shuffle up data from a thread that is not participating in the call, i.e. thread 30 is trying to get data from thread 29, so thread 29 has to participate.from the docs. Although this explanation is still unsatisfactory, as you seem to get a deadlock instead of an undefined value. But maybe this is wanted behavior for a debug build?
That being said, I'm not quite sure how to do this elegantly, because just including thread 29 in the conditional and mask will only shift the problem to 29 trying to get data from 28. In the examples given in the documentation, they always do the intrinsic with all threads and then conditionally use the results.
My best guess is that you want thread 29 to participate, but with a delta of 0. I have not found anything saying that delta needs to be the same across threads.
You might also want to use
__ballot_sync()
to retrieve a mask as can be seen in Listing 3 of this blogpost to avoid bugs from manually specifying the mask, which needs to be changed whenever the conditional is changed.