内存操作可以重新排序以适合它们运行的设备。该规范（基本上）规定，任何内存操作的重新排序都必须确保内存在单个工作项内处于一致的状态。但是，如果您（例如）执行存储操作并且值决定暂时驻留在工作项特定的缓存中，直到出现更好的时间写入本地/全局内存，该怎么办？如果您尝试从该内存加载，写入该值的工作项会将其保存在其缓存中，因此没有问题。但工作组内的其他工作项则不然，因此它们可能会读取错误的值。放置内存栅栏可确保在调用内存栅栏时，本地/全局内存（根据参数）将保持一致（任何缓存都将被刷新，并且任何重新排序都将考虑到您期望其他线程可能会发生的情况）需要在此之后访问此数据）。

我承认这仍然令人困惑，我不会发誓我的理解是100%正确的，但我认为这至少是总体想法。

后续：

我发现这个链接讨论了 CUDA 内存栅栏，但同样的一般思想也适用于 OpenCL：

http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf

查看B.5 内存栅栏函数< /强>。

他们有一个代码示例，可以在一次调用中计算一组数字的总和。该代码被设置为计算每个工作组中的部分和。然后，如果需要进行更多求和，代码将让最后一个工作组来完成工作。

因此，每个工作组基本上完成两件事：更新全局变量的部分求和，然后计数器全局变量的原子增量。

此后，如果还有更多工作要做，则将计数器增加到值（“工作组大小”- 1）的工作组将被视为最后一个工作组。该工作组将继续完成工作。

现在，问题（正如他们所解释的）是，由于内存重新排序和/或缓存，计数器可能会增加，并且最后一个工作组可能会在部分总和全局变量完成之前开始执行其工作。写入全局内存的最新值。

内存栅栏将确保在移过栅栏之前所有线程的部分和变量的值是一致的。

我希望这有一定道理。这很令人困惑。

To try to put it more clearly (hopefully),

mem_fence() waits until all reads/writes to local and/or global memory made by the calling work-item prior to mem_fence() are visible to all threads in the work-group.

That comes from: http://developer.download.nvidia.com/presentations/2009/SIGGRAPH/asia/3_OpenCL_Programming.pdf

Memory operations can be reordered to suit the device they are running on. The spec states (basically) that any reordering of memory operations must ensure that memory is in a consistent state within a single work-item. However, what if you (for example) perform a store operation and value decides to live in a work-item specific cache for now until a better time presents itself to write through to local/global memory? If you try to load from that memory, the work-item that wrote the value has it in its cache, so no problem. But other work-items within the work-group don't, so they may read the wrong value. Placing a memory fence ensures that, at the time of the memory fence call, local/global memory (as per the parameters) will be made consistent (any caches will be flushed, and any reordering will take into account that you expect other threads may need to access this data after this point).

I admit it is still confusing, and I won't swear that my understanding is 100% correct, but I think it is at least the general idea.

Follow Up:

I found this link which talks about CUDA memory fences, but the same general idea applies to OpenCL:

http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf

Check out section B.5 Memory Fence Functions.

They have a code example that computes the sum of an array of numbers in one call. The code is set up to compute a partial sum in each work-group. Then, if there is more summing to do, the code has the last work-group do the work.

So, basically 2 things are done in each work-group: A partial sum, which updates a global variable, then atomic increment of a counter global variable.

After that, if there is any more work left to do, the work-group that incremented the counter to the value of ("work-group size" - 1) is taken to be the last work-group. That work-group goes on to finish up.

Now, the problem (as they explain it) is that, because of memory re-ordering and/or caching, the counter may get incremented and the last work-group may begin to do its work before that partial sum global variable has had its most recent value written to global memory.

A memory fence will ensure that the value of that partial sum variable is consistent for all threads before moving past the fence.

I hope this makes some sense. It is confusing.

回复收藏 0 原文