CUDA 中的 threadfence 内在函数的用途是什么?
我浏览了许多论坛帖子和 NVIDIA 文档,但我无法理解 __threadfence() 的作用以及如何使用它。有人可以解释一下该内在函数的目的是什么吗?
I have gone through many forum posts and the NVIDIA documentation, but I couldn't understand what __threadfence()
does and how to use it. Could someone explain what the purpose of that intrinsic is?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通常,不能保证如果一个块向全局内存写入某些内容,另一个块会“看到”它。除了发出写入的块之外,也无法保证全局内存的写入顺序。
有两个例外:
想象一下,一个块生成一些数据,然后使用原子操作来标记数据存在的标志。但另一个块可能在看到该标志后仍然读取不正确或不完整的数据。
__threadfence
函数来帮忙,确保顺序。从其他块可以看出,它之前的所有写入实际上发生在它之后的所有写入之前。请注意,__threadfence 函数不一定需要停止当前线程,直到其对全局内存的写入对网格中的所有其他线程可见为止。以这种幼稚的方式实现,
__threadfence
函数可能会严重损害性能。例如,如果您执行以下操作:
__threadfence()
,则可以保证如果另一个块看到该标志,它也会看到该数据。
进一步阅读:Cuda 编程指南,章节 B .5(从版本 11.5 开始)
Normally, there are no guarantee that if one block writes something to global memory, the other block will "see" it. There is also no guarantee regarding the ordering of writes to global memory, with an exception of the block that issued it.
There are two exceptions:
Imagine, that one block produces some data, and then uses atomic operation to mark a flag that the data is there. But it is possible that the other block, after seeing the flag, still reads incorrect or incomplete data.
The
__threadfence
function, coming to the rescue, ensures the ordering. All writes before it really happen before all writes after it, as seen from other blocks.Note that the
__threadfence
function doesn't necessarily need to stall the current thread until its writes to global memory are visible to all other threads in the grid. Implemented in this naive way, the__threadfence
function could hurt performance severely.As an example, if you do something like:
__threadfence()
it is guaranteed that if the other block sees the flag, it will also see the data.
Further reading: Cuda Programming Guide, Chapter B.5 (as of version 11.5)