如何计算内核的 Gflops

发布于 2024-12-11 14:11:39 字数 626 浏览 0 评论 0原文

我想要衡量我的内核归档了多少峰值性能。

假设我有一个 NVIDIA Tesla C1060，其峰值 GFLOPS 为 622.08 (~= 240Cores * 1300MHz * 2). 现在，在我的内核中，我为每个线程计算了 16000 次 flop（4000 x（2 次减法、1 次乘法和 1 sqrt））。因此，当我有 1,000,000 个线程时，我会得出 16GFLOP。由于内核需要 0.1 秒，我将归档 160GFLOPS，这将是峰值性能的四分之一。现在我的问题是：

这种方法正确吗？
比较（if(a>b) then....）怎么样？我也必须考虑它们吗？
我可以使用 CUDA 分析器来获得更简单、更准确的结果吗？我尝试了指令计数器，但我无法弄清楚该数字的含义。

姐妹问题：如何计算 CUDA 实现的带宽内核

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

彻夜缠绵 2024-12-18 14:11:39

首先是一些一般性评论：

一般来说，您所做的大部分都是徒劳的，并且与大多数人可能进行性能分析的方式相反。

首先要说明的是，您引用的峰值严格适用于浮点乘加指令 (FMAD)，该指令算作 2 FLOPS，并且可以以每周期一次的最大速率退出。其他以每个周期最大速率退出的浮点运算在形式上仅被归类为单个 FLOP，而其他浮点运算可能需要许多周期才能退出。因此，如果您决定引用该峰值的内核性能，那么您实际上是将代码性能与纯 FMAD 指令流进行比较，仅此而已。

第二点是，当研究人员从一段代码中引用 FLOP/s 值时，他们通常使用模型 FLOP 计数进行操作，而不是尝试对指令进行计数。矩阵乘法和 Linpack LU 分解基准是这种性能基准测试方法的经典示例。这些计算的操作计数的下限是准确已知的，因此计算出的吞吐量只是该下限除以时间。实际指令数无关紧要。程序员经常使用各种技术，包括冗余计算、推测或预测计算以及许多其他想法来使代码运行得更快。此类代码的实际 FLOP 计数无关紧要，参考始终是模型 FLOP 计数。

最后，在量化性能时，通常只有两个真正感兴趣的比较点：

在相同的硬件上，代码的版本 A 是否比版本 B 运行得更快？
在执行感兴趣的任务时，硬件 A 的性能是否比硬件 B 更好？

在第一种情况下，您实际上只需要测量执行时间。在第二种情况下，合适的衡量标准通常不是 FLOP/s，而是每单位时间的有用操作（排序中每秒的记录数、流体机械模拟中每秒的单元数等）。有时，如上所述，有用的操作可以是已知理论复杂度的操作的模型 FLOP 计数。但实际的浮点指令计数很少（如果有的话）进入分析。

如果您真正感兴趣的是优化和了解代码的性能，那么也许来自 NVIDIA 的 Paulius Micikevicius 的演讲可能会引起您的兴趣。

解决要点问题：

这种方法正确吗？

严格来说，不。如果您正在计算浮点运算，则需要从 GPU 运行的代码中了解准确的 FLOP 计数。例如，sqrt 操作可能比单个 FLOP 消耗更多的资源，具体取决于其实现及其所操作的数字的特征。编译器还可以执行许多优化，这可能会改变实际操作/指令计数。获得真正准确计数的唯一方法是反汇编编译的代码并计算各个浮点操作数，甚至可能需要对代码将计算的值的特征进行假设。

比较怎么样（if(a>b) then....）？我也必须考虑它们吗？

它们不是浮点乘加运算，所以不是。

我可以使用 CUDA 分析器来获得更简单、更准确的结果吗？我尝试了指令计数器，但我无法弄清楚该数字的含义。

并不真地。探查器无法区分浮点指令和任何其他类型的指令，因此（截至 2011 年）不可能通过探查器对一段代码进行 FLOP 计数。

[编辑：请参阅下面 Greg 的出色答案，了解自编写本答案以来发布的分析工具版本中可用的 FLOP 计数功能的讨论]

First some general remarks:

In general, what you are doing is mostly an exercise in futility and is the reverse of how most people would probably go about performance analysis.

The first point to make is that the peak value you are quoting is for strictly for floating point multiply-add instructions (FMAD), which count as two FLOPS, and can be retired at a maximum rate of one per cycle. Other floating point operations which retire at a maximum rate of one per cycle would formally only be classified as a single FLOP, while others might require many cycles to be retired. So if you decided to quote kernel performance against that peak, you are really comparing your codes performance against a stream of pure FMAD instructions, and nothing more than that.

The second point is that when researchers quote FLOP/s values from a piece of code, they are usually using a model FLOP count for the operation, not trying to count instructions. Matrix multiplication and the Linpack LU factorization benchmarks are classic examples of this approach to performance benchmarking. The lower bound of the operation count of those calculations is exactly known, so the calculated throughput is simply that lower bound divided by the time. The actual instruction count is irrelevent. Programmers often use all sorts of techniques, including rundundant calculations, speculative or predictive calculations, and a host of other ideas to make code run faster. The actual FLOP count of such code is irrelevent, the reference is always the model FLOP count.

Finally, when looking at quantifying performance, there are usually only two points of comparison of any real interest

Does version A of the code run faster than version B on the same hardware?
Does hardware A perform better than hardware B doing the task of interest?

In the first case you really only need to measure execution time. In the second, a suitable measure usually isn't FLOP/s, it is useful operations per unit time (records per second in sorting, cells per second in a fluid mechanical simulation, etc). Sometimes, as mentioned above, the useful operations can be the model FLOP count of an operation of known theoretical complexity. But the actual floating point instruction count rarely, if ever, enters into the analysis.

If your interest is really about optimization and understanding the performance of your code, then maybe this presentation by Paulius Micikevicius from NVIDIA might be of interest.

Addressing the bullet point questions:

Is this approach correct?

Strictly speaking, no. If you are counting floating point operations, you would need to know the exact FLOP count from the code the GPU is running. The sqrt operation can consume a lot more than a single FLOP, depending on its implementation and the characteristics of the number it is operating on, for example. The compiler can also perform a lot of optimizations which might change the actual operation/instruction count. The only way to get a truly accurate count would be to disassemble compiled code and count the individual floating point operands, perhaps even requiring assumptions about the characteristics of values the code will compute.

What about comparisons (if(a>b) then....)? Do I have to consider them as well?

They are not floating point multiply-add operations, so no.

Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.

Not really. The profiler can't differentiate between a floating point intruction and any other type of instruction, so (as of 2011) FLOP count from a piece of code via the profiler is not possible.

[EDIT: see Greg's excellent answer below for a discussion of the FLOP counting facilities available in versions of the profiling tools released since this answer was written]

回复收藏 0 原文

鹤仙姿 2024-12-18 14:11:39

Nsight VSE (>3.2) 和 Visual Profiler (>=5.5) 支持 Achieved FLOPs 计算。为了收集指标，分析器运行内核两次（使用内核重放）。在第一次重放中，收集执行的浮点指令的数量（了解预测和主动掩码）。在第二次重播中，收集持续时间。

nvprof 和 Visual Profiler 有硬编码定义。 FMA 算作 2 次操作。所有其他操作均为 1 次操作。 flops_sp_* 计数器是线程指令执行计数，而 flops_sp 是加权和，因此可以使用各个指标应用一些权重。然而， flops_sp_special 涵盖了许多不同的指令。

Nsight VSE 实验配置允许用户定义每个指令类型的操作。

Nsight Visual Studio Edition

配置收集 Achieved FLOPS

执行菜单命令 Nsight >启动性能分析... 打开活动编辑器
将活动类型设置为分析 CUDA 应用程序
在实验设置中设置要运行的实验 自定义
在实验列表中添加实现的 FLOPS
在中间窗格中选择实现的 FLOPS
在右侧窗格中，您可以自定义每条执行指令的 FLOPS。 FMA 和 RSQ 的默认权重为 2。在某些情况下，我看到 RSQ 高达 5。
运行分析会话。

Nsight VSE 实现的 FLOPS 实验配置

查看实现的 FLOPS

在 nvreport 中，打开 CUDA 启动 报告页面。
在CUDA Launches页面中选择一个内核。
在报告相关性窗格（左下角）中，选择实现的 FLOPS

Nsight VSE 实现的 FLOPS 结果

nvprof

指标可用（在 K20 上）

nvprof --query-metrics | grep flop
flops_sp:            Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
flops_sp_add:        Number of single-precision floating-point add operations executed by non-predicated threads
flops_sp_mul:        Number of single-precision floating-point multiply operations executed by non-predicated threads
flops_sp_fma:        Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_dp:            Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)
flops_dp_add:        Number of double-precision floating-point add operations executed by non-predicated threads
flops_dp_mul:        Number of double-precision floating-point multiply operations executed by non-predicated threads
flops_dp_fma:        Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_sp_special:    Number of single-precision floating-point special operations executed by non-predicated threads
flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations
flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations

集合和结果

nvprof --devices 0 --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
==2452== NVPROF is profiling process 2452, command: matrixMul.exe
GPU Device 0: "Tesla K20c" with compute capability 3.5

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 6.18 GFlop/s, Time= 21.196 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==2452== Profiling application: matrixMul.exe
==2452== Profiling result:
==2452== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K20c (0)"
        Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
        301                                  flops_sp                             FLOPS(Single)   131072000   131072000   131072000
        301                              flops_sp_add                         FLOPS(Single Add)           0           0           0
        301                              flops_sp_mul                         FLOPS(Single Mul)           0           0           0
        301                              flops_sp_fma                         FLOPS(Single FMA)    65536000    65536000    65536000

注意： flops_sp = flops_sp_add + flops_sp_mul + flops_sp_special + (2 * flops_sp_fma)（大约）

Visual Profiler

Visual Profiler 支持上面 nvprof 部分中显示的指标。

Nsight VSE (>3.2) and the Visual Profiler (>=5.5) support Achieved FLOPs calculation. In order to collect the metric the profilers run the kernel twice (using kernel replay). In the first replay the number of floating point instructions executed is collected (with understanding of predication and active mask). in the second replay the duration is collected.

nvprof and Visual Profiler have a hardcoded definition. FMA counts as 2 operations. All other operations are 1 operation. The flops_sp_* counters are thread instruction execution counts whereas flops_sp is the weighted sum so some weighting can be applied using the individual metrics. However, flops_sp_special covers a number of different instructions.

The Nsight VSE experiment configuration allows the user to define the operations per instruction type.

Nsight Visual Studio Edition

Configuring to collect Achieved FLOPS

Execute the menu command Nsight > Start Performance Analysis... to open the Activity Editor
Set Activity Type to Profile CUDA Application
In Experiment Settings set Experiments to Run to Custom
In the Experiment List add Achieved FLOPS
In the middle pane select Achieved FLOPS
In the right pane you can custom the FLOPS per instruction executed. The default weighting is for FMA and RSQ to count as 2. In some cases I have seen RSQ as high as 5.
Run the Analysis Session.

Nsight VSE Achieved FLOPS Experiment Configuration

Viewing Achieved FLOPS

In the nvreport open the CUDA Launches report page.
In the CUDA Launches page select a kernel.
In the report correlation pane (bottom left) select Achieved FLOPS

Nsight VSE Achieved FLOPS Results

nvprof

Metrics Available (on a K20)

nvprof --query-metrics | grep flop
flops_sp:            Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
flops_sp_add:        Number of single-precision floating-point add operations executed by non-predicated threads
flops_sp_mul:        Number of single-precision floating-point multiply operations executed by non-predicated threads
flops_sp_fma:        Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_dp:            Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)
flops_dp_add:        Number of double-precision floating-point add operations executed by non-predicated threads
flops_dp_mul:        Number of double-precision floating-point multiply operations executed by non-predicated threads
flops_dp_fma:        Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_sp_special:    Number of single-precision floating-point special operations executed by non-predicated threads
flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations
flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations

Collection and Results

nvprof --devices 0 --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
==2452== NVPROF is profiling process 2452, command: matrixMul.exe
GPU Device 0: "Tesla K20c" with compute capability 3.5

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 6.18 GFlop/s, Time= 21.196 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==2452== Profiling application: matrixMul.exe
==2452== Profiling result:
==2452== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K20c (0)"
        Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
        301                                  flops_sp                             FLOPS(Single)   131072000   131072000   131072000
        301                              flops_sp_add                         FLOPS(Single Add)           0           0           0
        301                              flops_sp_mul                         FLOPS(Single Mul)           0           0           0
        301                              flops_sp_fma                         FLOPS(Single FMA)    65536000    65536000    65536000

NOTE: flops_sp = flops_sp_add + flops_sp_mul + flops_sp_special + (2 * flops_sp_fma) (approximately)