CUDA原子操作可以使用L1缓存吗?

发布于 2025-02-10 08:50:18 字数 1487 浏览 1 评论 0 原文

CC:7.5 Windows:10.0 CUDA:11.7

我在设备内存上执行一堆原子操作。经纱中的每个线程都在连续的UINT32_T上操作。在块中的每个翘曲都会更新相同的值,然后再进入下一行。

由于我没有使用任何共享内存,因此我希望它可以用于缓存设备内存,有效地对共享内存进行原子和无需所有syncthreads的头顶和头痛并复制数据。

但是表现表明这不是发生的事情。

确实,看着Nsight,说L1缓存的命中率为0%。哎哟。内存工作负载分析还显示在全球原子ALU下受到0%的命中。

google出现一次命中>始终通过L2进行设备内存。不完全是权威来源,但它与我所看到的相符。另一方面,有似乎暗示它确实(做到了?)到达L1。一个更具权威性的来源,但不完全是重点。

我可以配置错误吗?也许我的代码不做我认为是的?还是针对设备内存进行原子操作总是通过L2进行?

  • 我尝试使用而不是原子,但这没有任何区别。
  • 我还尝试使用atomicand_block而不是仅仅是原子和,这使事情变得更慢了?不是我所期望的。
  • 我想尝试 redux ,但是CC 8.0对我来说还不是一个选择。 __sshfl_sync证明是令人失望的(在性能方面)。

在这一点上,我倾向于相信,在7.5中,设备内存上的原子总是通过L2进行。但是,如果有人有相反的证据,我可以继续挖掘。

cc: 7.5 Windows: 10.0 cuda: 11.7

I'm performing a bunch of atomic operations on device memory. Every thread in a warp is operating on a consecutive uint32_t. And every warp in the block updates those same values, before they all move on to the next line.

Since I'm not using any shared memory, I was hoping that it would be used to cache the device memory, effectively doing an atomicAnd against shared memory without all the overhead and headaches of syncthreads and copying the data around.

But the performance suggests that's not what's happening.

Indeed, looking at NSight, it's saying there's a 0% hit rate in L1 cache. Ouch. The memory workload analysis also shows 0% Hit under Global Atomic ALU.

Google turned up one hit (somewhat dated) suggesting that atomic is always done via L2 for device memory. Not exactly an authoritative source, but it matches what I'm seeing. On the other hand, there's this which seems to suggest it does (did?) go thru L1. A more authoritative source, but not exactly on point.

Could I have something misconfigured? Maybe my code isn't doing what I think it is? Or do atomic operations against device memory always go thru L2?

  • I tried using RED instead of atomics, but that didn't make any difference.
  • I also tried using atomicAnd_block instead of just atomicAnd, and somehow that made things even slower? Not what I expected.
  • I'd like to experiment with redux, but cc 8.0 isn't an option for me yet. __shfl_sync turned out to be disappointing (performance-wise).

At this point I'm inclined to believe that in 7.5, atomics on device memory always go thru L2. But if someone has evidence to the contrary, I can keep digging.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

柠北森屋 2025-02-17 08:50:18

与NVIDIA一样,很难获得具体信息。但是我们可以看一下 ptx文档几件事。

原子负载和存储

原子负载和商店使用其常规 ld st 指令的变体,这些指令具有以下模式:

ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type  d, [a]{, cache-policy};
ld.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type;

st{.weak}{.ss}{.cop}{.level::cache_hint}{.vec}.type   [a], b{, cache-policy};
st.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};

fef/code loads and Store是常规的内存操作。 COP 部分指定缓存行为。出于我们的目的,有 ld.cg (Cache-Global)仅使用L2缓存和 ld.ca (Cache-All),该(Cache-All)使用L1和L2 Cache 。正如文档所指出的那样:

全局数据在L2级别是一致的,但是对于全局数据而言,多个L1缓存并不连贯。如果一个线程通过一个L1缓存将一个线程存储到全局内存,而第二个线程则加载了通过第二个L1缓存的地址,则使用 ld.ca ,则第二个线程可能会获取陈旧的L1缓存数据,而不是数据由第一个线程存储。驱动程序必须在并行线程的相关网格之间使全局L1缓存线无效。然后,由第一个网格程序的商店通过第二个网格程序正确获取默认 ld.ca 加载l1中的加载。

同样,有 st.cg 仅在L2中缓存。它“绕过L1缓存”。措辞并不精确,但听起来好像这应该使L1缓存无效。否则,即使在单个线程中, ld.ca的序列也是如此;街ld.ca 会读取过时的数据,这听起来像是一个疯狂的想法。

写作的第二个相关 cog 是 st.wb (写入书)。文档中的措辞非常奇怪。我想这写回了L1 CACHE,后来可能会驱逐到L2及以上。

ld.sem st.Sem (其中SEM是放松,获取或发行的一种)是真正的原子负载和商店。范围给出了同步的范围,这意味着例如,获取是在螺纹块中还是在整个GPU中同步。

注意这些操作如何没有 COP 元素。因此,您甚至无法指定缓存层。您可以给出缓存提示,但我看不出它们足以指定所需的语义。 Cache_hint Cache-Policy 仅在L2上工作。

只有 eviction_priority 提到L1。但是,仅仅因为接受性能的提示并不意味着它具有任何效果。我认为它适用于虚弱的内存操作,但对于原子,只有L2策略有任何影响。但这只是猜想。

原子读取模式写入

arom 指令用于原子交换,比较和swap,添加等。 red 用于减少。它们具有以下结构:

atom{.sem}{.scope}{.space}.op{.level::cache_hint}.type d, [a], b{, cache-policy};
red{.sem}{.scope}{.space}.op{.level::cache_hint}.type     [a], b{, cache-policy};

使用以下元素:

  • SEM:内存同步行为,例如获取,释放或放松
  • 范围:内存同步范围,例如在CTA(线程块)或GPU中获取 - 释放
  • :全局或共享存储器
  • 缓存策略,级别和提示:缓存驱逐策略。但是没有L1的选项,只有L2

鉴于没有办法指定L1缓存或写下背部行为,因此无法在L1 CACHE上使用原子RMW操作。这使得这很大程度上使对我有意义。为什么GPU废物晶体管要实施它?存在共享内存的目的是确切的目的,允许在线程块中快速内存操作。

As usual with Nvidia, concrete information is hard to come by. But we can have a look at the PTX documentation and infer a few things.

Atomic load and store

Atomic loads and stores use variations of their regular ld and st instructions which have the following pattern:

ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type  d, [a]{, cache-policy};
ld.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.level::prefetch_size}{.vec}.type;

st{.weak}{.ss}{.cop}{.level::cache_hint}{.vec}.type   [a], b{, cache-policy};
st.sem.scope{.ss}{.level::eviction_priority}{.level::cache_hint}{.vec}.type [a], b{, cache-policy};

weak loads and stores are regular memory operations. The cop part specifies the cache behavior. For our purposes, there is ld.cg (cache-global) that only uses the L2 cache and ld.ca (cache-all), which uses L1 and L2 cache. As the documentation notes:

Global data is coherent at the L2 level, but multiple L1 caches are not coherent for global data. If one thread stores to global memory via one L1 cache, and a second thread loads that address via a second L1 cache with ld.ca, the second thread may get stale L1 cache data, rather than the data stored by the first thread. The driver must invalidate global L1 cache lines between dependent grids of parallel threads. Stores by the first grid program are then correctly fetched by the second grid program issuing default ld.ca loads cached in L1.

Similarly, there is st.cg which caches only in L2. It "bypasses the L1 cache." The wording isn't precise but it sounds as if this should invalidate the L1 cache. Otherwise even within a single thread, a sequence of ld.ca; st.cg; ld.ca would read stale data and that sounds like an insane idea.

The second relevant cog for write is st.wb (write-back). The wording in the documentation is very weird. I guess this writes back to L1 cache and may later evict to L2 and up.

The ld.sem and st.sem (where sem is one of relaxed, acquire, or release) are the true atomic loads and stores. Scope gives the, well, scope of the synchronization, meaning for example whether an acquire is synchronized within a thread block or on the whole GPU.

Notice how these operations have no cop element. So you cannot even specify a cache layer. You can give cache hints but I don't see how those are sufficient to specify the desired semantics. cache_hint and cache-policy only work on L2.

Only the eviction_priority mentions L1. But just because that performance hint is accepted does not mean it has any effect. I assume it works for weak memory operations but for atomics, only the L2 policies have any effect. But this is just conjecture.

Atomic Read-modify-write

The atom instruction is used for atomic exchange, compare-and-swap, addition, etc. red is used for reductions. They have the following structure:

atom{.sem}{.scope}{.space}.op{.level::cache_hint}.type d, [a], b{, cache-policy};
red{.sem}{.scope}{.space}.op{.level::cache_hint}.type     [a], b{, cache-policy};

With these elements:

  • sem: memory synchronization behavior, such as as acquire, release, or relaxed
  • scope: memory synchronization scope, e.g. acquire-release within a CTA (thread block) or GPU
  • space: global or shared memory
  • cache policy, level and hint: cache eviction policy. But there are no options for L1, only L2

Given that there is no way to specify L1 caching or write-back behavior, there is no way of using atomic RMW operations on L1 cache. This makes a lot of sense to me. Why should the GPU waste transistors on implementing this? Shared memory exists for the exact purpose of allowing fast memory operations within a thread block.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文