pytorch profiler 输出中的 cudaLaunchKernel 是什么
我正在尝试分析我的 pytorch 网络以了解瓶颈是什么。我注意到有一个名为 cudaLaunchKernel
的操作占用了大部分时间。 这个答案表示,使用 cuda 完成的每个操作都会调用它。如果假设我用 C++ 或任何其他语言实现这个网络,是否可以减少这个时间?
基本上,我问这个开销是否是因为我已经在 python 中实现了我的网络,或者这个开销是否始终存在并且无法用任何语言进行优化?
完整的分析器输出:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaLaunchKernel 99.80% 933.739ms 99.80% 933.739ms 20.750ms 0.000us 0.00% 0.000us 0.000us 45
model_inference 0.05% 453.000us 100.00% 935.567ms 935.567ms 0.000us 0.00% 195.000us 195.000us 1
aten::cudnn_convolution 0.04% 388.000us 99.84% 934.047ms 103.783ms 195.000us 100.00% 195.000us 21.667us 9
aten::_convolution 0.01% 138.000us 99.88% 934.419ms 103.824ms 0.000us 0.00% 195.000us 21.667us 9
aten::conv2d 0.01% 122.000us 99.89% 934.592ms 103.844ms 0.000us 0.00% 195.000us 21.667us 9
aten::add_ 0.01% 112.000us 0.02% 155.000us 17.222us 0.000us 0.00% 0.000us 0.000us 9
aten::upsample_nearest2d 0.01% 82.000us 0.01% 105.000us 26.250us 0.000us 0.00% 0.000us 0.000us 4
aten::empty 0.01% 79.000us 0.01% 79.000us 3.292us 0.000us 0.00% 0.000us 0.000us 24
aten::threshold 0.01% 74.000us 0.02% 149.000us 18.625us 0.000us 0.00% 0.000us 0.000us 8
aten::_cat 0.01% 71.000us 0.01% 119.000us 29.750us 0.000us 0.00% 0.000us 0.000us 4
aten::relu 0.01% 57.000us 0.02% 206.000us 25.750us 0.000us 0.00% 0.000us 0.000us 8
aten::convolution 0.01% 51.000us 99.88% 934.470ms 103.830ms 0.000us 0.00% 195.000us 21.667us 9
aten::view 0.01% 50.000us 0.01% 50.000us 5.556us 0.000us 0.00% 0.000us 0.000us 9
aten::cat 0.00% 32.000us 0.02% 151.000us 37.750us 0.000us 0.00% 0.000us 0.000us 4
aten::reshape 0.00% 29.000us 0.01% 79.000us 8.778us 0.000us 0.00% 0.000us 0.000us 9
aten::resize_ 0.00% 25.000us 0.00% 25.000us 0.962us 0.000us 0.00% 0.000us 0.000us 26
aten::rsub 0.00% 21.000us 0.00% 33.000us 33.000us 0.000us 0.00% 0.000us 0.000us 1
aten::mul 0.00% 17.000us 0.00% 27.000us 27.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zeros 0.00% 13.000us 0.00% 16.000us 16.000us 0.000us 0.00% 0.000us 0.000us 1
cudaEventRecord 0.00% 12.000us 0.00% 12.000us 1.333us 0.000us 0.00% 0.000us 0.000us 9
cudaBindTexture 0.00% 11.000us 0.00% 11.000us 2.750us 0.000us 0.00% 0.000us 0.000us 4
aten::empty_strided 0.00% 6.000us 0.00% 6.000us 6.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zero_ 0.00% 1.000us 0.00% 1.000us 1.000us 0.000us 0.00% 0.000us 0.000us 1
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::ma... 0.00% 0.000us 0.00% 0.000us 0.000us 195.000us 100.00% 195.000us 195.000us 1
cudaUnbindTexture 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 4
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 935.583ms
Self CUDA time total: 195.000us
PS:一些配置
Python版本:3.8.8
PyTorch 版本:1.8.1
cuda工具包版本:10.2.89
cuda 版本(由 nvidia-smi 提供):11.4
CPU 规格:intel core i7 10700 @ 2.90GHz 16 核 GPU 规格:NVIDIA GM204GL [Quadro M4000] 内存:64GB GPU内存:8GB 操作系统:64位Ubuntu 20.04.3
PPS:我不是在寻找加速我的代码的方法。我想知道是否可以通过使用 cpp 等不同语言或直接使用 cuda 进行编码来加快速度。 (假设如果我的所有数据都已经在 GPU 上,并且我已经用 cuda 语言本身编写了代码,那么它会在 195us
内运行吗?)
I'm trying to profile my pytorch network to see what is the bottleneck. I noticed that there is an operation called cudaLaunchKernel
which is taking up most of the time. This answer says that it is called for every operation done with cuda. If suppose I implement this network in C++ or any other language, would it be possible to reduce this time?
Basically, I'm asking if this overhead is because I've implemented my network in python or will this overhead be always there and impossible to optimize in any language?
Full profiler output:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaLaunchKernel 99.80% 933.739ms 99.80% 933.739ms 20.750ms 0.000us 0.00% 0.000us 0.000us 45
model_inference 0.05% 453.000us 100.00% 935.567ms 935.567ms 0.000us 0.00% 195.000us 195.000us 1
aten::cudnn_convolution 0.04% 388.000us 99.84% 934.047ms 103.783ms 195.000us 100.00% 195.000us 21.667us 9
aten::_convolution 0.01% 138.000us 99.88% 934.419ms 103.824ms 0.000us 0.00% 195.000us 21.667us 9
aten::conv2d 0.01% 122.000us 99.89% 934.592ms 103.844ms 0.000us 0.00% 195.000us 21.667us 9
aten::add_ 0.01% 112.000us 0.02% 155.000us 17.222us 0.000us 0.00% 0.000us 0.000us 9
aten::upsample_nearest2d 0.01% 82.000us 0.01% 105.000us 26.250us 0.000us 0.00% 0.000us 0.000us 4
aten::empty 0.01% 79.000us 0.01% 79.000us 3.292us 0.000us 0.00% 0.000us 0.000us 24
aten::threshold 0.01% 74.000us 0.02% 149.000us 18.625us 0.000us 0.00% 0.000us 0.000us 8
aten::_cat 0.01% 71.000us 0.01% 119.000us 29.750us 0.000us 0.00% 0.000us 0.000us 4
aten::relu 0.01% 57.000us 0.02% 206.000us 25.750us 0.000us 0.00% 0.000us 0.000us 8
aten::convolution 0.01% 51.000us 99.88% 934.470ms 103.830ms 0.000us 0.00% 195.000us 21.667us 9
aten::view 0.01% 50.000us 0.01% 50.000us 5.556us 0.000us 0.00% 0.000us 0.000us 9
aten::cat 0.00% 32.000us 0.02% 151.000us 37.750us 0.000us 0.00% 0.000us 0.000us 4
aten::reshape 0.00% 29.000us 0.01% 79.000us 8.778us 0.000us 0.00% 0.000us 0.000us 9
aten::resize_ 0.00% 25.000us 0.00% 25.000us 0.962us 0.000us 0.00% 0.000us 0.000us 26
aten::rsub 0.00% 21.000us 0.00% 33.000us 33.000us 0.000us 0.00% 0.000us 0.000us 1
aten::mul 0.00% 17.000us 0.00% 27.000us 27.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zeros 0.00% 13.000us 0.00% 16.000us 16.000us 0.000us 0.00% 0.000us 0.000us 1
cudaEventRecord 0.00% 12.000us 0.00% 12.000us 1.333us 0.000us 0.00% 0.000us 0.000us 9
cudaBindTexture 0.00% 11.000us 0.00% 11.000us 2.750us 0.000us 0.00% 0.000us 0.000us 4
aten::empty_strided 0.00% 6.000us 0.00% 6.000us 6.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zero_ 0.00% 1.000us 0.00% 1.000us 1.000us 0.000us 0.00% 0.000us 0.000us 1
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::ma... 0.00% 0.000us 0.00% 0.000us 0.000us 195.000us 100.00% 195.000us 195.000us 1
cudaUnbindTexture 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us 4
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 935.583ms
Self CUDA time total: 195.000us
PS: Some configs
Python version: 3.8.8
PyTorch version: 1.8.1
cudatoolkit version: 10.2.89
cuda version (as given by nvidia-smi): 11.4
CPU specs: intel core i7 10700 @ 2.90GHz 16 cores
GPU specs: NVIDIA GM204GL [Quadro M4000]
RAM: 64GB
GPU RAM: 8GB
OS: 64-bit Ubuntu 20.04.3
PPS: I'm not looking for ways to speed up my code. I want to know if it is possible to speed it up by coding it in a different language like cpp or directly in cuda. (Like suppose if all my data is already on GPU, and I've written my code in cuda language itself, would it run in 195us
?)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据 CUDA 文档,调用
cudaLaunchKernel
来启动设备函数,简而言之,它是在GPU设备上运行的代码。因此,分析器表明大量计算在 GPU 上运行(正如您可能预期的那样),这需要在设备上传输数据结构。这可能是瓶颈的根源。
我通常不使用 CUDA 进行开发,但也许您可以通过在 CUDA 中进行更多操作和更少的 CPU/GPU 传输来编写更大的内核来加速该过程。
请参阅本教程了解更多详细信息。
According to CUDA docs,
cudaLaunchKernel
is called to launch a device function, which, in short, is code that is run on a GPU device.The profiler, therefore, states that a lot of computation is run on the GPU (as you probably expected) and this requires the data structures to be transferred on the device. This may be the source of the bottleneck.
I don't usually develop in CUDA, but perhaps you can speed up the process by coding larger kernels with more operation in CUDA and less CPU/GPU transferrals.
Have a look at this tutorial for more details.