CUDA 中全局内存的原子操作是否跨扭曲并行执行?
我需要在 CC 2.0 设备上的全局内存上执行原子 FP 添加操作。如果 warp 中引用的全局数据适合对齐的 128 字节扇区,这些操作是并行完成还是一次执行一个?
我的猜测是它们是平行的,但我不确定这
一点 高瑟姆·加纳帕蒂
I need to do an atomic FP add operation on global memory on a CC 2.0 device. If the global data referenced in a warp fit into an aligned 128-byte sector, will these operations be done in parallel or will they be executed one at a time?
My guess would be that they are parallel, but I am not sure of this
Regards
Gautham Ganapathy
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编程时,您可以将原子操作视为概念上的并行(同时仍然满足原子性的要求)。
优化时,它有助于了解可能发生的序列化。实际发生的情况取决于您运行的硬件。性能取决于原子内存单元的位置和数量,以及并行执行的内存访问模式。
例如,如果并行寻址的位置映射到完全不同的原子单元,则它们将并行发生。如果并行的多个地址映射到同一个原子单元,则必须将它们串行化。
原子操作性能从 sm_11(首次出现的计算能力 1.1)到 sm_2x(费米器件)再到 sm_3x(开普勒器件)不断提高。 Kepler 将最坏情况原子内存操作性能(许多原子操作访问相同的内存地址)提高了 10 倍,将最佳情况性能(许多原子操作访问非常不同的内存地址)提高了 2 倍。开普勒上的原子性能足够高,您可以考虑使用原子,而以前您可能使用显式并行归约代码。请参阅 此演示文稿了解更多详细信息。
注意:此讨论适用于全局内存原子。共享内存原子是一种不同的野兽,通常会导致序列化,因此不具有非常高的性能。
When programming you can think of atomic operations as conceptually parallel (while still satisfying the requirements of atomicity).
When optimizing it helps to be aware of serialization that might be occuring. What actually happens depends on the hardware you are running on. Performance depends on the location and number of atomic memory units, as well as the pattern of memory accesses being performed in parallel.
For example, if the locations that are addressed in parallel map to completely different atomic units, they will occur in parallel. If many addresses in parallel map to the same atomic unit, they must be serialized.
Atomic operation performance has improved consistently from sm_11 (Compute capability 1.1, where it first appeared), to sm_2x (Fermi devices), to sm_3x (Kepler devices). Kepler improved worst-case atomic memory operation performance (where many atomic operations access the same memory address) by up to 10X, and best case performance (where many atomic operations access very different memory addresses) by up to 2X. Atomic performance on Kepler is high enough that you may consider using atomics where previously you might have employed explicit parallel reduction code. See this presentation for more details.
Note: this discussion applies to global memory atomics. Shared memory atomics are a different beast, and in general result in serialization and are therefore do not have very high performance.
原子操作比普通操作慢,因为它们确实不能并行发生。
可能会发生的情况是,每次添加都会一次完成一个,但在所有线程完成添加之前,执行不会继续进行,从代码的角度来看,它看起来是并行的。
我不确定访问是否会合并,但原子操作的速度损失可能会超过内存访问速度的好处。
Atomic operations are slower than normal operations, because they really can't happen in parallel.
What will probably happen is that each add will be done one at a time, but execution won't progress past the add until all the threads have completed it, it will look parallel from the code's perspective.
I'm not sure if the access will be coalesced or not, but the speed penalty from the atomic operations will probably outweigh the memory access speed benefit.
重新表述一下已经说过的话:原子操作将按顺序执行,但由于所有其他操作此时都将停止,因此它们看起来像是同时(并行)执行的。需要注意的一件重要事情是,虽然原子操作是顺序的,但它们的 ORDER 无法控制。
To rephrase what has already been said: ATOMIC operations will be performed in sequence, but since all other operations will be halted at the moment, they will APPEAR to have been performed at the same time (in parallel). One important thing to note is that although atomic operations are sequencial, their ORDER cannot be controlled.