pytorch profiler 输出中的 cudaLaunchKernel 是什么

发布于 2025-01-11 04:51:51 字数 6535 浏览 2 评论 0原文

我正在尝试分析我的 pytorch 网络以了解瓶颈是什么。我注意到有一个名为 cudaLaunchKernel 的操作占用了大部分时间。这个答案表示，使用 cuda 完成的每个操作都会调用它。如果假设我用 C++ 或任何其他语言实现这个网络，是否可以减少这个时间？

基本上，我问这个开销是否是因为我已经在 python 中实现了我的网络，或者这个开销是否始终存在并且无法用任何语言进行优化？

完整的分析器输出：

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       cudaLaunchKernel        99.80%     933.739ms        99.80%     933.739ms      20.750ms       0.000us         0.00%       0.000us       0.000us            45  
                                        model_inference         0.05%     453.000us       100.00%     935.567ms     935.567ms       0.000us         0.00%     195.000us     195.000us             1  
                                aten::cudnn_convolution         0.04%     388.000us        99.84%     934.047ms     103.783ms     195.000us       100.00%     195.000us      21.667us             9  
                                     aten::_convolution         0.01%     138.000us        99.88%     934.419ms     103.824ms       0.000us         0.00%     195.000us      21.667us             9  
                                           aten::conv2d         0.01%     122.000us        99.89%     934.592ms     103.844ms       0.000us         0.00%     195.000us      21.667us             9  
                                             aten::add_         0.01%     112.000us         0.02%     155.000us      17.222us       0.000us         0.00%       0.000us       0.000us             9  
                               aten::upsample_nearest2d         0.01%      82.000us         0.01%     105.000us      26.250us       0.000us         0.00%       0.000us       0.000us             4  
                                            aten::empty         0.01%      79.000us         0.01%      79.000us       3.292us       0.000us         0.00%       0.000us       0.000us            24  
                                        aten::threshold         0.01%      74.000us         0.02%     149.000us      18.625us       0.000us         0.00%       0.000us       0.000us             8  
                                             aten::_cat         0.01%      71.000us         0.01%     119.000us      29.750us       0.000us         0.00%       0.000us       0.000us             4  
                                             aten::relu         0.01%      57.000us         0.02%     206.000us      25.750us       0.000us         0.00%       0.000us       0.000us             8  
                                      aten::convolution         0.01%      51.000us        99.88%     934.470ms     103.830ms       0.000us         0.00%     195.000us      21.667us             9  
                                             aten::view         0.01%      50.000us         0.01%      50.000us       5.556us       0.000us         0.00%       0.000us       0.000us             9  
                                              aten::cat         0.00%      32.000us         0.02%     151.000us      37.750us       0.000us         0.00%       0.000us       0.000us             4  
                                          aten::reshape         0.00%      29.000us         0.01%      79.000us       8.778us       0.000us         0.00%       0.000us       0.000us             9  
                                          aten::resize_         0.00%      25.000us         0.00%      25.000us       0.962us       0.000us         0.00%       0.000us       0.000us            26  
                                             aten::rsub         0.00%      21.000us         0.00%      33.000us      33.000us       0.000us         0.00%       0.000us       0.000us             1  
                                              aten::mul         0.00%      17.000us         0.00%      27.000us      27.000us       0.000us         0.00%       0.000us       0.000us             1  
                                            aten::zeros         0.00%      13.000us         0.00%      16.000us      16.000us       0.000us         0.00%       0.000us       0.000us             1  
                                        cudaEventRecord         0.00%      12.000us         0.00%      12.000us       1.333us       0.000us         0.00%       0.000us       0.000us             9  
                                        cudaBindTexture         0.00%      11.000us         0.00%      11.000us       2.750us       0.000us         0.00%       0.000us       0.000us             4  
                                    aten::empty_strided         0.00%       6.000us         0.00%       6.000us       6.000us       0.000us         0.00%       0.000us       0.000us             1  
                                            aten::zero_         0.00%       1.000us         0.00%       1.000us       1.000us       0.000us         0.00%       0.000us       0.000us             1  
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::ma...         0.00%       0.000us         0.00%       0.000us       0.000us     195.000us       100.00%     195.000us     195.000us             1  
                                      cudaUnbindTexture         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             4  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 935.583ms
Self CUDA time total: 195.000us

PS：一些配置
Python版本：3.8.8
PyTorch 版本：1.8.1
cuda工具包版本：10.2.89
cuda 版本（由 nvidia-smi 提供）：11.4

CPU 规格：intel core i7 10700 @ 2.90GHz 16 核 GPU 规格：NVIDIA GM204GL [Quadro M4000] 内存：64GB GPU内存：8GB 操作系统：64位Ubuntu 20.04.3

PPS：我不是在寻找加速我的代码的方法。我想知道是否可以通过使用 cpp 等不同语言或直接使用 cuda 进行编码来加快速度。（假设如果我的所有数据都已经在 GPU 上，并且我已经用 cuda 语言本身编写了代码，那么它会在 195us 内运行吗？）

原文

I'm trying to profile my pytorch network to see what is the bottleneck. I noticed that there is an operation called cudaLaunchKernel which is taking up most of the time. This answer says that it is called for every operation done with cuda. If suppose I implement this network in C++ or any other language, would it be possible to reduce this time?

Basically, I'm asking if this overhead is because I've implemented my network in python or will this overhead be always there and impossible to optimize in any language?

Full profiler output:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       cudaLaunchKernel        99.80%     933.739ms        99.80%     933.739ms      20.750ms       0.000us         0.00%       0.000us       0.000us            45  
                                        model_inference         0.05%     453.000us       100.00%     935.567ms     935.567ms       0.000us         0.00%     195.000us     195.000us             1  
                                aten::cudnn_convolution         0.04%     388.000us        99.84%     934.047ms     103.783ms     195.000us       100.00%     195.000us      21.667us             9  
                                     aten::_convolution         0.01%     138.000us        99.88%     934.419ms     103.824ms       0.000us         0.00%     195.000us      21.667us             9  
                                           aten::conv2d         0.01%     122.000us        99.89%     934.592ms     103.844ms       0.000us         0.00%     195.000us      21.667us             9  
                                             aten::add_         0.01%     112.000us         0.02%     155.000us      17.222us       0.000us         0.00%       0.000us       0.000us             9  
                               aten::upsample_nearest2d         0.01%      82.000us         0.01%     105.000us      26.250us       0.000us         0.00%       0.000us       0.000us             4  
                                            aten::empty         0.01%      79.000us         0.01%      79.000us       3.292us       0.000us         0.00%       0.000us       0.000us            24  
                                        aten::threshold         0.01%      74.000us         0.02%     149.000us      18.625us       0.000us         0.00%       0.000us       0.000us             8  
                                             aten::_cat         0.01%      71.000us         0.01%     119.000us      29.750us       0.000us         0.00%       0.000us       0.000us             4  
                                             aten::relu         0.01%      57.000us         0.02%     206.000us      25.750us       0.000us         0.00%       0.000us       0.000us             8  
                                      aten::convolution         0.01%      51.000us        99.88%     934.470ms     103.830ms       0.000us         0.00%     195.000us      21.667us             9  
                                             aten::view         0.01%      50.000us         0.01%      50.000us       5.556us       0.000us         0.00%       0.000us       0.000us             9  
                                              aten::cat         0.00%      32.000us         0.02%     151.000us      37.750us       0.000us         0.00%       0.000us       0.000us             4  
                                          aten::reshape         0.00%      29.000us         0.01%      79.000us       8.778us       0.000us         0.00%       0.000us       0.000us             9  
                                          aten::resize_         0.00%      25.000us         0.00%      25.000us       0.962us       0.000us         0.00%       0.000us       0.000us            26  
                                             aten::rsub         0.00%      21.000us         0.00%      33.000us      33.000us       0.000us         0.00%       0.000us       0.000us             1  
                                              aten::mul         0.00%      17.000us         0.00%      27.000us      27.000us       0.000us         0.00%       0.000us       0.000us             1  
                                            aten::zeros         0.00%      13.000us         0.00%      16.000us      16.000us       0.000us         0.00%       0.000us       0.000us             1  
                                        cudaEventRecord         0.00%      12.000us         0.00%      12.000us       1.333us       0.000us         0.00%       0.000us       0.000us             9  
                                        cudaBindTexture         0.00%      11.000us         0.00%      11.000us       2.750us       0.000us         0.00%       0.000us       0.000us             4  
                                    aten::empty_strided         0.00%       6.000us         0.00%       6.000us       6.000us       0.000us         0.00%       0.000us       0.000us             1  
                                            aten::zero_         0.00%       1.000us         0.00%       1.000us       1.000us       0.000us         0.00%       0.000us       0.000us             1  
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::ma...         0.00%       0.000us         0.00%       0.000us       0.000us     195.000us       100.00%     195.000us     195.000us             1  
                                      cudaUnbindTexture         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             4  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 935.583ms
Self CUDA time total: 195.000us

PS: Some configs
Python version: 3.8.8
PyTorch version: 1.8.1
cudatoolkit version: 10.2.89
cuda version (as given by nvidia-smi): 11.4

CPU specs: intel core i7 10700 @ 2.90GHz 16 cores
GPU specs: NVIDIA GM204GL [Quadro M4000]
RAM: 64GB
GPU RAM: 8GB
OS: 64-bit Ubuntu 20.04.3

PPS: I'm not looking for ways to speed up my code. I want to know if it is possible to speed it up by coding it in a different language like cpp or directly in cuda. (Like suppose if all my data is already on GPU, and I've written my code in cuda language itself, would it run in 195us?)

分享到QQ

分享到微博