如何计算 CUDA 内核实现的带宽

发布于 2024-12-11 17:40:22 字数 2180 浏览 0 评论 0原文

我想要测量我的内核归档了多少峰值内存带宽。

假设我有一台 NVIDIA Tesla C1060,其最大带宽为 102.4 GB/s。在我的内核中,我对全局内存进行了以下访问:

    ...
    for(int k=0;k>4000;k++){
        float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
        ....
    }
    out_data[index]=result;
    out_data2[index]=sqrt(result);
    ...

我计算每个线程 4000*2+2 次对全局内存的访问。拥有 1.000.000 个线程并且所有访问都是浮动的,我有大约 32GB 的全局内存访问(添加了入站和出站)。由于我的内核只需要 0.1 秒,我会归档 ~320GB/s,这高于最大带宽,因此我的计算/假设存在错误。我认为 CUDA 会进行一些缓存,因此并非所有内存访问都有效。现在我的问题是:

  • 我的错误是什么?
  • 哪些对全局内存的访问被缓存,哪些没有?
  • 我不计算对寄存器、本地、共享和常量内存的访问,这是否正确?
  • 我可以使用 CUDA 分析器来获得更简单、更准确的结果吗?我需要使用哪些计数器?我需要如何解释它们?

探查器输出:

method              gputime    cputime  occupancy instruction warp_serial memtransfer
memcpyHtoD           10.944         17                                          16384
fill                  64.32         93          1       14556           0
fill                 64.224         83          1       14556           0
memcpyHtoD           10.656         11                                          16384
fill                 64.064         82          1       14556           0
memcpyHtoD          1172.96       1309                                        4194304
memcpyHtoD           10.688         12                                          16384
cu_more_regT      93223.906      93241          1    40716656           0
memcpyDtoH         1276.672       1974                                        4194304
memcpyDtoH         1291.072       2019                                        4194304
memcpyDtoH          1278.72       2003                                        4194304
memcpyDtoH             1840       3172                                        4194304

新问题: - 当 4194304Bytes = 4Bytes * 1024*1024 数据点 = 4MB 且 gpu_time ~= 0.1 秒时,我实现的带宽为 10*40MB/s = 400MB/s。这看起来很低。错误在哪里?

ps 告诉我您是否需要其他计数器来回答。

姐妹问题:如何计算内核的 Gflops

I want a measure of how much of the peak memory bandwidth my kernel archives.

Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:

    ...
    for(int k=0;k>4000;k++){
        float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
        ....
    }
    out_data[index]=result;
    out_data2[index]=sqrt(result);
    ...

I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:

  • What is my error?
  • What accesses to global memory are cached and which are not?
  • Is it correct that I don't count access to registers, local, shared and constant memory?
  • Can I use the CUDA profiler for easier and more accurate results? Which counters would I need to use? How would I need to interpret them?

Profiler output:

method              gputime    cputime  occupancy instruction warp_serial memtransfer
memcpyHtoD           10.944         17                                          16384
fill                  64.32         93          1       14556           0
fill                 64.224         83          1       14556           0
memcpyHtoD           10.656         11                                          16384
fill                 64.064         82          1       14556           0
memcpyHtoD          1172.96       1309                                        4194304
memcpyHtoD           10.688         12                                          16384
cu_more_regT      93223.906      93241          1    40716656           0
memcpyDtoH         1276.672       1974                                        4194304
memcpyDtoH         1291.072       2019                                        4194304
memcpyDtoH          1278.72       2003                                        4194304
memcpyDtoH             1840       3172                                        4194304

New question:
- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time ~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?

p.s. Tell me if you need other counters for your answer.

sister question: How to calculate Gflops of a kernel

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

花间憩 2024-12-18 17:40:22
  • 您实际上并没有同时运行 1.000.000 个线程。您执行约 32GB 的全局内存访问,其中带宽将由 SM 中运行(读取)的当前线程和读取的数据大小给出。
  • 除非您向编译器指定未缓存的数据,否则全局内存中的所有访问都会缓存在 L1 和 L2 中。
  • 我想是的。获得的带宽与全局内存有关。
  • 我建议使用可视化分析器来查看读/写/全局内存带宽。如果你发布你的结果会很有趣:)。

Visual Profiler 中的默认计数器为您提供了足够的信息来了解您的内核(内存带宽、共享内存组冲突、执行的指令...)。

关于您的问题,计算实现的全局内存吞吐量:

计算视觉分析器。 DU-05162-001_v02 | 2010 年 10 月。用户指南。第 56 页,表 7。支持的派生统计数据。

全局内存读取吞吐量(以千兆字节每秒为单位)。对于计算能力 2.0 计算公式为 (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime 对于计算能力 >= 2.0,这是计算公式为 ((DRAM 读取) * 32) / gputime

希望有帮助。

  • You do not really have 1.000.000 of threads running at once. You do ~32GB of global memory accesses where the bandwidth will be given by the current threads running (reading) in the SMs and the size of the data read.
  • All accesses in global memory are cached in L1 and L2 unless you specify un-cached data to the compiler.
  • I think so. Achieved bandwidth is related to global memory.
  • I will recommend use the visual profiler to see the read/write/global memory bandwidth. Would be interesting if you post your result :).

Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).

Regarding to your question, to calculate the achieved global memory throughput:

Compute Visual Profiler. DU-05162-001_v02 | October 2010. User Guide. Page 56, Table 7. Supported Derived Statistics.

Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / gputime

Hope this help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文