如何计算 CUDA 内核实现的带宽

发布于 2024-12-11 17:40:22 字数 2180 浏览 0 评论 0原文

我想要测量我的内核归档了多少峰值内存带宽。

假设我有一台 NVIDIA Tesla C1060，其最大带宽为 102.4 GB/s。在我的内核中，我对全局内存进行了以下访问：

    ...
    for(int k=0;k>4000;k++){
        float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
        ....
    }
    out_data[index]=result;
    out_data2[index]=sqrt(result);
    ...

我计算每个线程 4000*2+2 次对全局内存的访问。拥有 1.000.000 个线程并且所有访问都是浮动的，我有大约 32GB 的全局内存访问（添加了入站和出站）。由于我的内核只需要 0.1 秒，我会归档 ~320GB/s，这高于最大带宽，因此我的计算/假设存在错误。我认为 CUDA 会进行一些缓存，因此并非所有内存访问都有效。现在我的问题是：

我的错误是什么？
哪些对全局内存的访问被缓存，哪些没有？
我不计算对寄存器、本地、共享和常量内存的访问，这是否正确？
我可以使用 CUDA 分析器来获得更简单、更准确的结果吗？我需要使用哪些计数器？我需要如何解释它们？

探查器输出：

method              gputime    cputime  occupancy instruction warp_serial memtransfer
memcpyHtoD           10.944         17                                          16384
fill                  64.32         93          1       14556           0
fill                 64.224         83          1       14556           0
memcpyHtoD           10.656         11                                          16384
fill                 64.064         82          1       14556           0
memcpyHtoD          1172.96       1309                                        4194304
memcpyHtoD           10.688         12                                          16384
cu_more_regT      93223.906      93241          1    40716656           0
memcpyDtoH         1276.672       1974                                        4194304
memcpyDtoH         1291.072       2019                                        4194304
memcpyDtoH          1278.72       2003                                        4194304
memcpyDtoH             1840       3172                                        4194304

新问题： - 当 4194304Bytes = 4Bytes * 1024*1024 数据点 = 4MB 且 gpu_time ~= 0.1 秒时，我实现的带宽为 10*40MB/s = 400MB/s。这看起来很低。错误在哪里？

ps 告诉我您是否需要其他计数器来回答。

姐妹问题：如何计算内核的 Gflops

原文

I want a measure of how much of the peak memory bandwidth my kernel archives.

Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:

    ...
    for(int k=0;k>4000;k++){
        float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
        ....
    }
    out_data[index]=result;
    out_data2[index]=sqrt(result);
    ...

I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:

What is my error?
What accesses to global memory are cached and which are not?
Is it correct that I don't count access to registers, local, shared and constant memory?
Can I use the CUDA profiler for easier and more accurate results? Which counters would I need to use? How would I need to interpret them?

Profiler output:

method              gputime    cputime  occupancy instruction warp_serial memtransfer
memcpyHtoD           10.944         17                                          16384
fill                  64.32         93          1       14556           0
fill                 64.224         83          1       14556           0
memcpyHtoD           10.656         11                                          16384
fill                 64.064         82          1       14556           0
memcpyHtoD          1172.96       1309                                        4194304
memcpyHtoD           10.688         12                                          16384
cu_more_regT      93223.906      93241          1    40716656           0
memcpyDtoH         1276.672       1974                                        4194304
memcpyDtoH         1291.072       2019                                        4194304
memcpyDtoH          1278.72       2003                                        4194304
memcpyDtoH             1840       3172                                        4194304

New question:
- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time ~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?

p.s. Tell me if you need other counters for your answer.

sister question: How to calculate Gflops of a kernel

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花间憩 2024-12-18 17:40:22

您实际上并没有同时运行 1.000.000 个线程。您执行约 32GB 的全局内存访问，其中带宽将由 SM 中运行（读取）的当前线程和读取的数据大小给出。
除非您向编译器指定未缓存的数据，否则全局内存中的所有访问都会缓存在 L1 和 L2 中。
我想是的。获得的带宽与全局内存有关。
我建议使用可视化分析器来查看读/写/全局内存带宽。如果你发布你的结果会很有趣:)。

Visual Profiler 中的默认计数器为您提供了足够的信息来了解您的内核（内存带宽、共享内存组冲突、执行的指令...）。

关于您的问题，计算实现的全局内存吞吐量：

计算视觉分析器。 DU-05162-001_v02 | 2010 年 10 月。用户指南。第 56 页，表 7。支持的派生统计数据。
全局内存读取吞吐量（以千兆字节每秒为单位）。对于计算能力 2.0 计算公式为 (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime 对于计算能力 >= 2.0，这是计算公式为 ((DRAM 读取) * 32) / gputime

希望有帮助。