在 OpenCL 中,mem_fence() 与 Barrier() 相比有何作用?
与barrier()
(我想我理解)不同,mem_fence()
不会影响工作组中的所有项目。 OpenCL 规范规定(第 6.11.10 节),对于 mem_fence()
:
命令加载和存储执行内核的工作项。
(因此它适用于单个工作项)。
但同时,在第 3.3.1 节中,它说:
在工作项内存中具有加载/存储一致性。
因此在工作项内内存是一致的。
那么mem_fence()有什么用途呢?它不能跨项目工作,但在项目内不需要...
请注意,我没有使用原子操作(第 9.5 节等)。 mem_fence()
的想法是否与这些结合使用?如果是这样,我很想看一个例子。
谢谢。
更新 :我可以看到它与 barrier()
一起使用时有什么用处(隐式地,因为屏障调用 mem_fence()
) - 但肯定有一定更多,因为它是单独存在的?
Unlike barrier()
(which I think I understand), mem_fence()
does not affect all items in the work group. The OpenCL spec says (section 6.11.10), for mem_fence()
:
Orders loads and stores of a work-item executing a kernel.
(so it applies to a single work item).
But, at the same time, in section 3.3.1, it says that:
Within a work-item memory has load / store consistency.
so within a work item the memory is consistent.
So what kind of thing is mem_fence()
useful for? It doesn't work across items, yet isn't needed within an item...
Note that I haven't used atomic operations (section 9.5 etc). Is the idea that mem_fence()
is used in conjunction with those? If so, I'd love to see an example.
Thanks.
Update: I can see how it is useful when used with barrier()
(implicitly, since the barrier calls mem_fence()
) - but surely there must be more, since it exists separately?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为了更清楚地表达(希望如此),
来自: http://developer.download.nvidia.com /presentations/2009/SIGGRAPH/asia/3_OpenCL_Programming.pdf
内存操作可以重新排序以适合它们运行的设备。该规范(基本上)规定,任何内存操作的重新排序都必须确保内存在单个工作项内处于一致的状态。但是,如果您(例如)执行存储操作并且值决定暂时驻留在工作项特定的缓存中,直到出现更好的时间写入本地/全局内存,该怎么办?如果您尝试从该内存加载,写入该值的工作项会将其保存在其缓存中,因此没有问题。但工作组内的其他工作项则不然,因此它们可能会读取错误的值。放置内存栅栏可确保在调用内存栅栏时,本地/全局内存(根据参数)将保持一致(任何缓存都将被刷新,并且任何重新排序都将考虑到您期望其他线程可能会发生的情况)需要在此之后访问此数据)。
我承认这仍然令人困惑,我不会发誓我的理解是100%正确的,但我认为这至少是总体想法。
后续:
我发现这个链接讨论了 CUDA 内存栅栏,但同样的一般思想也适用于 OpenCL:
http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf
查看B.5 内存栅栏函数< /强>。
他们有一个代码示例,可以在一次调用中计算一组数字的总和。该代码被设置为计算每个工作组中的部分和。然后,如果需要进行更多求和,代码将让最后一个工作组来完成工作。
因此,每个工作组基本上完成两件事:更新全局变量的部分求和,然后计数器全局变量的原子增量。
此后,如果还有更多工作要做,则将计数器增加到值(“工作组大小”- 1)的工作组将被视为最后一个工作组。该工作组将继续完成工作。
现在,问题(正如他们所解释的)是,由于内存重新排序和/或缓存,计数器可能会增加,并且最后一个工作组可能会在部分总和全局变量完成之前开始执行其工作。写入全局内存的最新值。
内存栅栏将确保在移过栅栏之前所有线程的部分和变量的值是一致的。
我希望这有一定道理。这很令人困惑。
To try to put it more clearly (hopefully),
That comes from: http://developer.download.nvidia.com/presentations/2009/SIGGRAPH/asia/3_OpenCL_Programming.pdf
Memory operations can be reordered to suit the device they are running on. The spec states (basically) that any reordering of memory operations must ensure that memory is in a consistent state within a single work-item. However, what if you (for example) perform a store operation and value decides to live in a work-item specific cache for now until a better time presents itself to write through to local/global memory? If you try to load from that memory, the work-item that wrote the value has it in its cache, so no problem. But other work-items within the work-group don't, so they may read the wrong value. Placing a memory fence ensures that, at the time of the memory fence call, local/global memory (as per the parameters) will be made consistent (any caches will be flushed, and any reordering will take into account that you expect other threads may need to access this data after this point).
I admit it is still confusing, and I won't swear that my understanding is 100% correct, but I think it is at least the general idea.
Follow Up:
I found this link which talks about CUDA memory fences, but the same general idea applies to OpenCL:
http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf
Check out section B.5 Memory Fence Functions.
They have a code example that computes the sum of an array of numbers in one call. The code is set up to compute a partial sum in each work-group. Then, if there is more summing to do, the code has the last work-group do the work.
So, basically 2 things are done in each work-group: A partial sum, which updates a global variable, then atomic increment of a counter global variable.
After that, if there is any more work left to do, the work-group that incremented the counter to the value of ("work-group size" - 1) is taken to be the last work-group. That work-group goes on to finish up.
Now, the problem (as they explain it) is that, because of memory re-ordering and/or caching, the counter may get incremented and the last work-group may begin to do its work before that partial sum global variable has had its most recent value written to global memory.
A memory fence will ensure that the value of that partial sum variable is consistent for all threads before moving past the fence.
I hope this makes some sense. It is confusing.
这就是我的理解(我还在尝试验证它)
memory_fence
只会确保内存是一致的并且对组中的所有线程可见,即执行不会停止,直到出现是另一个内存事务(本地或全局)。这意味着如果在memory_fence
之后有移动指令或添加指令,设备将继续执行这些“非内存事务”指令。另一方面,
barrier
将停止执行。并且只有在所有线程到达该点并且所有内存事务都已清除后才会继续。换句话说,
barrier
是mem_fence
的超集。事实证明,barrier
在性能方面比mem_fence
更昂贵。This is how I understand it (I'm still trying to verify it)
memory_fence
will only make sure the memory is consistent and visible to all threads in the group, i.e. the execution does NOT stop, until there is another memory transaction (local or global). Which means if there is a move instruction or an add instruction after amemory_fence
, the device will continue to execute these "non-memory transaction" instructions.barrier
on the other hand will stop execution, period. And will only proceed after all threads reach that point AND all the memory transactions have been cleared.In other words,
barrier
is a superset ofmem_fence
.barrier
can prove more expensive in terms of performance thanmem_fence
.栅栏确保在栅栏之前发出的装载和/或存储将在栅栏之后发出的任何装载和/或存储之前完成。仅凭栅栏并不意味着有罪。屏障操作支持一个或两个内存空间中的读/写屏障,以及阻塞直到给定工作组中的所有工作项到达它。
The fence ensures that loads and/or stores issued before the fence will complete before any loads and/or stores issued after the fence. No sinc is implied by the fences alone. The barrier operation supports a read/write fence in one or both memory spaces as well as blocking until all work items in a giver workgroup reach it.