了解内存传输性能 (CUDA)

发布于 2025-01-13 06:06:02 字数 1070 浏览 3 评论 0原文

auto ts = std::chrono::system_clock::now();

cudaMemcpyAsync((void**)in_dev, in_host, 1000 * size, cudaMemcpyHostToDevice, stream_in);
cudaMemcpyAsync((void**)out_host, out_dev, 1000 * size, cudaMemcpyDeviceToHost, stream_out);

cudaStreamSynchronize(stream_in);
cudaStreamSynchronize(stream_out);

time_data.push_back(std::chrono::system_clock::now() - ts);

这是我为自己的教育目的制定的基准的结果。非常简单，程序的每个“周期”都会启动并行数据传输，并在获取时间戳之前等待这些操作完成。

内核版本添加了一个简单的内核，该内核对每个数据字节（也在不同的流上）进行操作。内核执行时间的趋势对我来说很有意义 - 我的设备只有这么多 SM/核心，一旦我要求更多，它就会开始花费更长的时间。

我不明白的是，为什么仅内存传输测试在与核心限制几乎相同的数据大小点上开始呈指数级增长。我的设备的内存带宽标榜为 600 GB/s。此处传输 10 MB 平均需要约 1.5 毫秒，这并不是给定带宽的餐巾纸数学建议的结果。我的预期是内存传输延迟周围的时间几乎是恒定的，但情况似乎并非如此。

为了确认这不是我的盗版时间戳方法，我使用 NSight Compute 运行了仅内存版本，并确认从 N=1000 KB 到 N=10000 KB 将平均异步传输时间从约 80 us 增加到约 800 us。

我对 D/H 内存传输性能缺少什么？获得良好带宽的关键是重叠大量小型传输而不是大型传输，还是会因为有限的复制引擎瓶颈而变得更糟？

我在配备 PCIe4 系统的 RTX 3070 Ti 上运行了此基准测试。

原文

auto ts = std::chrono::system_clock::now();

cudaMemcpyAsync((void**)in_dev, in_host, 1000 * size, cudaMemcpyHostToDevice, stream_in);
cudaMemcpyAsync((void**)out_host, out_dev, 1000 * size, cudaMemcpyDeviceToHost, stream_out);

cudaStreamSynchronize(stream_in);
cudaStreamSynchronize(stream_out);

time_data.push_back(std::chrono::system_clock::now() - ts);

This is the results of a benchmark I made for my own educational purposes. Pretty simple, every 'cycle' of the program it launches parallel transfer of data and waits for those operations to be complete before taking a timestamp.

The kernel version adds a simple kernel that operates on every byte of data (also on a different stream). The trend of kernel execution time makes sense to me - my device only has so many SMs/cores and it will start taking longer once I ask for more.

What I don't understand is why the memory transfer only tests start ramping up exponentially at nearly the same data size point as the core limitations. The memory bandwidth for my device is advertised as 600 GB/s. Transferring 10 MB here takes on average ~1.5 milliseconds which isn't what napkin math would suggest given bandwidth. My expectation was that time would be nearly constant around the memory transfer latency, but that doesn't seem to be the case.

To confirm it was not my bootleg time stamp methods I ran the memory only version with NSight Compute and confirmed that going from N=1000 KB to N=10000 KB increased average async transfer time from ~80 us to around ~800 us.

What am I missing about D/H memory transfer performance? Is the key to getting good bandwidth overlapping lots of small transfers rather than large transfers or would that be worse because of limited copy engine bottlenecks?

I ran this benchmark on an RTX 3070 Ti with a pcie4 system.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兲鉂ぱ嘚淚 2025-01-20 06:06:02

许多 CUDA 操作可以粗略地建模为“开销”和“持续时间”。持续时间通常可以根据操作特性来预测，例如传输大小除以带宽。 “开销”可以粗略地建模为固定量——例如5微秒。

您的图表由几个测量值组成：

与启动传输或“周期”相关的“开销”。 CUDA 异步操作的最短持续时间通常为 5-50 微秒。这在蓝色曲线的“平坦”左侧表示。这里的“周期”代表两次传输，加上“内核”版本的内核启动开销。这些“开销”数字的组合代表蓝色和橙色曲线的 y 截距。从蓝色曲线到橙色曲线的距离表示内核操作的添加（您尚未显示）。在曲线的左侧，操作规模非常小，因此与“开销”贡献相比，“持续时间”部分的贡献很小。这解释了左侧曲线的近似平坦度。
操作的“持续时间”。在曲线的右侧，近似线性区域对应于“持续时间”贡献，因为它变大并且使“间接费用”成本相形见绌。蓝色曲线的斜率应与 PCIE 传输带宽相对应。对于 Gen4 系统，每个方向应约为 20-24GB/s（它与 600GB/s 的 GPU 内存带宽没有联系 - 它受到 PCIE 总线的限制。）橙色曲线的斜率也与 PCIE 有关带宽，因为这是整体操作的主要贡献者。
“内核”贡献。蓝色和橙色曲线之间的距离代表内核操作对 PCIE 数据传输的贡献。

我不明白的是，为什么仅内存传输测试在与核心限制几乎相同的数据大小点上开始呈指数级增长。我的设备的内存带宽标榜为 600 GB/s。此处传输 10 MB 平均需要约 1.5 毫秒，这不是餐巾纸数学所建议的给定带宽的结果。

这里的主导传输由 PCIE 总线控制。该带宽不是 600GB/s，而是每个方向 20-24GB/s。此外，除非您使用固定内存作为传输的主机内存，否则实际带宽将约为可实现的最大带宽的一半。这与您的测量结果非常吻合：10MB/1.5ms = 6.6GB/s。为什么这是有道理的？第一次传输时，您将以约 10GB/s 的速率传输 10MB。除非您使用固定内存，否则该操作将被阻塞并且不会与第二次传输同时执行。然后，您在第二次传输时以约 10GB/s 的速率传输 10MB。这是 10GB/s 时的 20MB，因此我们预计传输时间约为 2ms。您的实际传输速度可能接近 12GB/s，这将使预期非常接近 1.5 毫秒。

我的预期是内存传输延迟附近的时间几乎是恒定的，但情况似乎并非如此。

我不确定该声明的确切含义，但对于相当大的传输大小，预计时间不会独立于传输大小而恒定。时间应该是基于传输大小的乘数（带宽）。

我使用 NSight Compute 运行了仅内存版本，并确认从 N=1000 KB 到 N=10000 KB 将平均异步传输时间从约 80 us 增加到约 800 us。

这就是期望。传输更多数据需要更多时间。如果“持续时间”贡献明显大于“开销”贡献，您通常会观察到这种情况，图表右侧就是如此。

下面是一个电子表格，显示了一个具体示例，使用 12GB/s 的 PCIE 带宽和 5 微秒的固定操作开销。 “2 次操作总计”列非常接近地跟踪您的蓝色曲线：

Many CUDA operations can be crudely modeled as an "overhead" and a "duration". The duration is often predictable from the operation characteristics - e.g. the size of the transfer divided by the bandwidth. The "overhead" can be crudely modeled as a fixed quantity - e.g. 5 microseconds.

You graph consists of several measurements:

The "overhead" associated with initiating a transfer or "cycle". CUDA async ops generally have a minimum duration on the order of 5-50 microseconds. This is indicated in the "flat" left hand side of the blue curve. A "cycle" here represents two transfers, plus, in the case of the "kernel" version, the kernel launch overhead. The combination of these "overhead" numbers, represents the y-intercept of the blue and orange curves. The distance from the blue curve to the orange curve represents the addition of the kernel ops (which you haven't shown). On the left hand side of the curve, the operation sizes are so small that the contribution from the "duration" portion is small compared to the "overhead" constribution. This explains the approximate flatness of the curves on the left hand side.
The "duration" of operations. On the right hand side of the curves, the approximately linear region corresponds to the "duration" contribution as it becomes large and dwarfs the "overhead" cost. The slope of the blue curve should correspond to the PCIE transfer bandwidth. For a Gen4 system that should be approximately 20-24GB/s per direction (it has no connection to the 600GB/s of GPU memory bandwidth - it is limited by the PCIE bus.) The slope of the orange curve is also related to PCIE bandwidth, as this is the dominant contributor to the overall operation.
The "kernel" contribution. The distance between the blue and orange curves represent the contribution of the kernel ops, over/above just the PCIE data transfers.

What I don't understand is why the memory transfer only tests start ramping up exponentially at nearly the same data size point as the core limitations. The memory bandwidth for my device is advertised as 600 GB/s. Transferring 10 MB here takes on average ~1.5 milliseconds which isn't what napkin math would suggest given bandwidth.

The dominant transfer here is governed by the PCIE bus. That bandwidth is not 600GB/s but something like 20-24GB/s per direction. Furthermore, unless you are using pinned memory as the host memory for your transfers, the actual bandwidth will be about half of maximum achievable. This lines up pretty well with your measurement: 10MB/1.5ms = 6.6GB/s. Why does this make sense? You are transferring 10MB at a rate of ~10GB/s on the first transfer. Unless you are using pinned memory, the operation will block and will not execute concurrently with the 2nd transfer. Then you transfer 10MB at a rate of ~10GB/s on the second transfer. This is 20MB at 10GB/s, so we would expect to witness about a 2ms transfer time. Your actual transfer speeds might be closer to 12GB/s which would put the expectation very close to 1.5ms.

My expectation was that time would be nearly constant around the memory transfer latency, but that doesn't seem to be the case.

I'm not sure what that statement means, exactly, but for reasonably large transfer size, the time is not expected to be constant independent of the transfer size. The time should be a multiplier (the bandwidth) based on the transfer size.

I ran the memory only version with NSight Compute and confirmed that going from N=1000 KB to N=10000 KB increased average async transfer time from ~80 us to around ~800 us.

That is the expectation. Transferring more data takes more time. This is generally what you would observe if the "duration" contribution is significantly larger than the "overhead" contribution, which is true on the right hand side of your graph.

Here is a spreadsheet showing a specific example, using 12GB/s for PCIE bandwidth and 5 microseconds for the fixed operation overhead. The "total for 2 ops" column tracks your blue curve pretty closely: