CC:7.5 Windows:10.0 CUDA:11.7
我在设备内存上执行一堆原子操作。经纱中的每个线程都在连续的UINT32_T上操作。在块中的每个翘曲都会更新相同的值,然后再进入下一行。
由于我没有使用任何共享内存,因此我希望它可以用于缓存设备内存,有效地对共享内存进行原子和无需所有syncthreads的头顶和头痛并复制数据。
但是表现表明这不是发生的事情。
确实,看着Nsight,说L1缓存的命中率为0%。哎哟。内存工作负载分析还显示在全球原子ALU下受到0%的命中。
google出现一次命中>始终通过L2进行设备内存。不完全是权威来源,但它与我所看到的相符。另一方面,有似乎暗示它确实(做到了?)到达L1。一个更具权威性的来源,但不完全是重点。
我可以配置错误吗?也许我的代码不做我认为是的?还是针对设备内存进行原子操作总是通过L2进行?
- 我尝试使用而不是原子,但这没有任何区别。
- 我还尝试使用atomicand_block而不是仅仅是原子和,这使事情变得更慢了?不是我所期望的。
- 我想尝试 redux ,但是CC 8.0对我来说还不是一个选择。 __sshfl_sync证明是令人失望的(在性能方面)。
在这一点上,我倾向于相信,在7.5中,设备内存上的原子总是通过L2进行。但是,如果有人有相反的证据,我可以继续挖掘。
cc: 7.5 Windows: 10.0 cuda: 11.7
I'm performing a bunch of atomic operations on device memory. Every thread in a warp is operating on a consecutive uint32_t. And every warp in the block updates those same values, before they all move on to the next line.
Since I'm not using any shared memory, I was hoping that it would be used to cache the device memory, effectively doing an atomicAnd against shared memory without all the overhead and headaches of syncthreads and copying the data around.
But the performance suggests that's not what's happening.
Indeed, looking at NSight, it's saying there's a 0% hit rate in L1 cache. Ouch. The memory workload analysis also shows 0% Hit under Global Atomic ALU.
Google turned up one hit (somewhat dated) suggesting that atomic is always done via L2 for device memory. Not exactly an authoritative source, but it matches what I'm seeing. On the other hand, there's this which seems to suggest it does (did?) go thru L1. A more authoritative source, but not exactly on point.
Could I have something misconfigured? Maybe my code isn't doing what I think it is? Or do atomic operations against device memory always go thru L2?
- I tried using RED instead of atomics, but that didn't make any difference.
- I also tried using atomicAnd_block instead of just atomicAnd, and somehow that made things even slower? Not what I expected.
- I'd like to experiment with redux, but cc 8.0 isn't an option for me yet. __shfl_sync turned out to be disappointing (performance-wise).
At this point I'm inclined to believe that in 7.5, atomics on device memory always go thru L2. But if someone has evidence to the contrary, I can keep digging.
发布评论
评论(1)
与NVIDIA一样,很难获得具体信息。但是我们可以看一下 ptx文档几件事。
原子负载和存储
原子负载和商店使用其常规
ld
和st
指令的变体,这些指令具有以下模式:fef/code
loads and Store是常规的内存操作。COP
部分指定缓存行为。出于我们的目的,有ld.cg
(Cache-Global)仅使用L2缓存和ld.ca
(Cache-All),该(Cache-All)使用L1和L2 Cache 。正如文档所指出的那样:同样,有
st.cg
仅在L2中缓存。它“绕过L1缓存”。措辞并不精确,但听起来好像这应该使L1缓存无效。否则,即使在单个线程中,ld.ca的序列也是如此;街ld.ca
会读取过时的数据,这听起来像是一个疯狂的想法。写作的第二个相关
cog 是
st.wb
(写入书)。文档中的措辞非常奇怪。我想这写回了L1 CACHE,后来可能会驱逐到L2及以上。ld.sem
和st.Sem
(其中SEM是放松,获取或发行的一种)是真正的原子负载和商店。范围给出了同步的范围,这意味着例如,获取是在螺纹块中还是在整个GPU中同步。注意这些操作如何没有
COP
元素。因此,您甚至无法指定缓存层。您可以给出缓存提示,但我看不出它们足以指定所需的语义。Cache_hint
和Cache-Policy
仅在L2上工作。只有
eviction_priority
提到L1。但是,仅仅因为接受性能的提示并不意味着它具有任何效果。我认为它适用于虚弱的内存操作,但对于原子,只有L2策略有任何影响。但这只是猜想。原子读取模式写入
arom
指令用于原子交换,比较和swap,添加等。red
用于减少。它们具有以下结构:使用以下元素:
鉴于没有办法指定L1缓存或写下背部行为,因此无法在L1 CACHE上使用原子RMW操作。这使得这很大程度上使对我有意义。为什么GPU废物晶体管要实施它?存在共享内存的目的是确切的目的,允许在线程块中快速内存操作。
As usual with Nvidia, concrete information is hard to come by. But we can have a look at the PTX documentation and infer a few things.
Atomic load and store
Atomic loads and stores use variations of their regular
ld
andst
instructions which have the following pattern:weak
loads and stores are regular memory operations. Thecop
part specifies the cache behavior. For our purposes, there isld.cg
(cache-global) that only uses the L2 cache andld.ca
(cache-all), which uses L1 and L2 cache. As the documentation notes:Similarly, there is
st.cg
which caches only in L2. It "bypasses the L1 cache." The wording isn't precise but it sounds as if this should invalidate the L1 cache. Otherwise even within a single thread, a sequence ofld.ca; st.cg; ld.ca
would read stale data and that sounds like an insane idea.The second relevant
cog
for write isst.wb
(write-back). The wording in the documentation is very weird. I guess this writes back to L1 cache and may later evict to L2 and up.The
ld.sem
andst.sem
(where sem is one of relaxed, acquire, or release) are the true atomic loads and stores. Scope gives the, well, scope of the synchronization, meaning for example whether an acquire is synchronized within a thread block or on the whole GPU.Notice how these operations have no
cop
element. So you cannot even specify a cache layer. You can give cache hints but I don't see how those are sufficient to specify the desired semantics.cache_hint
andcache-policy
only work on L2.Only the
eviction_priority
mentions L1. But just because that performance hint is accepted does not mean it has any effect. I assume it works for weak memory operations but for atomics, only the L2 policies have any effect. But this is just conjecture.Atomic Read-modify-write
The
atom
instruction is used for atomic exchange, compare-and-swap, addition, etc.red
is used for reductions. They have the following structure:With these elements:
Given that there is no way to specify L1 caching or write-back behavior, there is no way of using atomic RMW operations on L1 cache. This makes a lot of sense to me. Why should the GPU waste transistors on implementing this? Shared memory exists for the exact purpose of allowing fast memory operations within a thread block.