GPU隐藏内存访问时间

发布于 2024-11-05 04:04:58 字数 308 浏览 1 评论 0原文

我知道 GPU 通常具有较长的内存访问时间。然而,性能并没有受到太大影响,因为在等待内存访问时执行其他指令会“隐藏”访问时间。

我只是想知道,如果您的波前有 64 个工作项和 16 个处理器核心,则每个处理器核心将有 64/16 = 4 工作项。此外,所有核心必须并行执行所有工作项。

因此,如果工作项需要内存访问,会发生什么?当然,由于所有指令都是相同的,因此您将有 16 次内存访问需要计算(或者只有 1 次?)。那么是否会替换每个核心上的 4 个工作项中的另一个来开始执行?这是否意味着所有 16 个处理器核心现在都在执行相同的新工作项。

I'm aware that GPUs generally have high memory access times. However, performance isn't greatly hampered as the access time is 'hidden' by executing other instructions whilst waiting for the memory access.

I was just wondering, if you have a wavefront with 64 work items, and 16 processor cores, each processor core will have 64/16 = 4 work items. Also, all the cores must execute all the work-items in parallel.

So if the work-item requires a memory access, what happens? Surely as all the instructions are the same, you would have 16 memory accesses to compute (or just 1?). Is it then the case that another one of the 4 work-items on each core is then substituted in to begin execution? Does this mean all 16 processor cores are now executing the same new work-item.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

淡看悲欢离合 2024-11-12 04:04:58

你的问题是以 AMD 为中心的,这是一个我不太熟悉的架构,但 NVIDIA 架构使用内存控制器设计,可以将 DRAM 访问请求融合到单个事务中(NVIDIA 所说的“内存合并”)。

基本思想是,内存控制器将位于较小地址范围 I 内的请求融合到单个加载或存储,以便为执行加载的 SIMD 组中的每个线程提供服务。最新的硬件支持 32,64,128 和 256 字节事务大小,并且内存控制器也足够智能,可以在访问的内存区域与事务大小边界不对齐的情况下向大型事务添加额外的单字大小的事务。

You question is rather AMD centric, and that is an architecture I am less fluent in, but the NVIDIA architecture uses a memory controller design which can fuse DRAM access requests into a single transaction ("memory coalescing" in NVIDIA speak).

The basic idea is that the memory controller will fuse requests that lie within a smallish address range I to a single load or store to service every thread in the SIMD group executing the load. The most recent hardware supports 32,64,128 and 256 byte transaction sizes, and the memory controller also is smart enough to add additional single word sized ansaction onto a large transaction in cases where the memory region accessed doesn't align to a transaction sized boundary.

忆依然 2024-11-12 04:04:58

你的问题很难回答,因为你把事情混在一起了。有理论(抽象)实体,例如工作项和波前(据我所知,NVIDIA 术语中的“Wavefront”=“Warp”)和物理实体,例如处理器和多处理器(nvidia)。

发明理论抽象是为了使您的程序独立于底层硬件配置。这样您就不必费心计算处理器的索引,该处理器将为 16 处理器 GPU 完成工作,然后为 32 处理器 GPU 执行新计算,
您只需考虑具有恒定大小的波前(扭曲)。

让我们回到你的问题:

“我知道 GPU 通常具有很高的内存访问时间。但是,由于在等待内存访问时执行其他指令来“隐藏”访问时间,因此性能并没有受到很大影响."

示例(技术上并不正确,但仅供参考):

假设我们正在执行 100 条算术指令,然后遇到内存请求。在物理层,扭曲/波前完成的指令执行是在几个硬件周期内完成的。以下是内存操作的发出方式:

Requested address   : a, b, c, d, -, -, -, -, -, -,  -,  -,  -,  -,  -,  -
Abstract WorkItems  : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
SIMD Hardware cores : 0, 1, 2, 3, -, -, -, -, -, -,  -,  -,  -,  -,  -,  -

NVIDIA 的扭曲需要 4 个周期来计算:

Requested address   : a, b, c, d, e, f, g, h, -, -,  -,  -,  -,  -,  -,  -
Abstract WorkItems  : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
SIMD Hardware cores : *, *, *, *, 0, 1, 2, 3, -, -,  -,  -,  -,  -,  -,  -

让我们跳过第 3 个周期。

Requested address   : a, b, c, d, e, f, g, h, i, j,  k,  l,  m,  n,  o,  p
Abstract WorkItems  : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
SIMD Hardware cores : *, *, *, *, *, *, *, *, *, *,  *,  *,  0,  1,  2,  3

在这 4 个周期内,内存请求会被累积。

根据请求的地址以及硬件的智能程度,这些请求将根据硬件规格进行合并。假设a..p0xFFF0..0xFFFF范围内按顺序排序,那么所有请求将在一个合并内存操作中得到满足。如果硬件遇到它不喜欢的地址(根据规范),它会将内存访问分解为多个内存操作。

由于当前 warp 请求内存操作,因此它会挂起并将物理处理器硬件切换到下一个 warp。新的扭曲首先执行 100 条指令,与之前的扭曲/波前执行的操作相同。在遇到并发出内存操作后,第二个扭曲/波前也暂停。此时,根据您的工作组大小和其他参数,硬件可能会恢复之前的扭曲或继续下一个扭曲。

Warp 的数量在内核执行期间是恒定的,并且在执行开始之前在主机上计算,这意味着如果您在内存请求之前没有这 100 条有用的指令,您最终将使所有 warp 处于挂起状态,这将导致硬件暂停和性能损失。

Your question is rather hard to answer because you mix things. There are theoretical (abstract) entities such as work-items and wavefronts(as far as I'm aware "Wavefront" = "Warp" in NVIDIA's terminology) and the physical ones such as processors and multiprocessors(nvidia).

The theoretical abstractions are invented to make your programs independent of underlying hardware configuration. So that you wouldn't bother computing indexes of a processor that will do the job for a 16-processor GPU and then do new computations for 32-processor GPUs,
you just think in terms of wavefronts(warps), which have constant sizes.

Let's get back to your question:

"I'm aware that GPUs generally have high memory access times. However, performance isn't greatly hampered as the access time is 'hidden' by executing other instructions whilst waiting for the memory access."

Example (it is not technically correct, but serves as an illustration):

Suppose we are doing 100 arithmetical instructions and then encounter memory request. At physical level instruction execution done by the warp/wavefront is done in several hardware cycles. Here's how memory operation is issued:

Requested address   : a, b, c, d, -, -, -, -, -, -,  -,  -,  -,  -,  -,  -
Abstract WorkItems  : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
SIMD Hardware cores : 0, 1, 2, 3, -, -, -, -, -, -,  -,  -,  -,  -,  -,  -

NVIDIA's warp takes 4 cycles to compute:

Requested address   : a, b, c, d, e, f, g, h, -, -,  -,  -,  -,  -,  -,  -
Abstract WorkItems  : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
SIMD Hardware cores : *, *, *, *, 0, 1, 2, 3, -, -,  -,  -,  -,  -,  -,  -

Lets skip the 3-rd cycle.

Requested address   : a, b, c, d, e, f, g, h, i, j,  k,  l,  m,  n,  o,  p
Abstract WorkItems  : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
SIMD Hardware cores : *, *, *, *, *, *, *, *, *, *,  *,  *,  0,  1,  2,  3

During these 4 cycles memory requests are accumulated.

Depending on what addresses are requested and how smart the hardware is these requests are served coalesced according to hardware specs. Suppose a..p are ordered sequentially within range 0xFFF0..0xFFFF then all of the requests will be served in one coalesced memory operation. If hardware encounters addresses that it doesn't like(according to specs) it will brake down memory access in to several memory operations.

Since current warp requested memory operation, it suspends and hardware switches physical processor to the next warp. New warp starts by doing its 100 instructions the same as this was done by the previos warp/wavefront. After encountering and issuing memory operation the second warp/wavefront also suspends. At this point depending on your work-group-size and other parameters the hardware may resume the previous warp or continue with the next ones.

The amount of warps is constant during kernel execution and is computed on host before execution starts, this means if you don't have these 100 useful instructions prior memory request, you will end up having all of your warps in a suspended state which will lead to hardware suspension and performance loss.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文