cudamalloc 比 cudamecpy 慢吗?
我正在编写一个需要时间高效的代码,因此使用 Cufft 来实现此目的,但是当我尝试并行计算非常大的数据的 fft 时,它比 cpu fftw 慢,并且在找到每个时间后我发现了原因使用高精度计时代码的代码行是 cudamalloc 大约需要 0.983 秒,而其余代码行的时间约为 0.00xx 秒,这是预期的......
我已经浏览了一些相关的帖子,但根据 他们
GPU 的主要延迟是由于内存传输而不是内存分配
,并且在其中一篇文章中写道:
对任何 cuda 库函数的第一次调用都会启动初始化子例程
这种延迟的实际原因是什么……或者在代码执行中出现这种延迟是不正常的吗???
提前致谢
i am working on a code which needs to be time efficient and thus using Cufft for this purpose but when i try to compute fft of a very large data in parallel it is slower than cpu fftw and the reason i find after finding the time for every line of code using high precision timing code is that cudamalloc taking around 0.983 sec while the time for rest of the lines of code is around 0.00xx sec which is expected ....
i have gone through some of the related posts but according to them
the main delay with GPUs is due to memory transfer not memory allocation
And also in one of the posts it was written that
The very first call to any of the cuda library functions launches an initialisation subroutine
what is the actual reason of this delay ...or is it not normal to have such delay in the execution of code???
Thanks in Advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您看到的大延迟(接近 1 秒)是否可能是由于驱动程序初始化造成的?对于 cudaMalloc 来说似乎相当长。另请检查您的驱动程序是否是最新的。
第一次内核启动的延迟可能是由于多种因素造成的:
第一个仅适用于在没有 X 的 Linux 系统上运行的情况。在这种情况下,仅在需要时才加载驱动程序并随后卸载。以 root 身份运行
nvidia-smi -pm 1
将以持久模式运行驱动程序以避免此类延迟,请查看man nvidia-smi
了解详细信息,并记住将其添加到init 脚本,因为它不会在重新启动后持续存在。第二个延迟是为系统中的特定设备架构编译 PTX。通过将设备架构(或架构,如果您想在不编译 PTX 的情况下支持多个架构)的二进制文件嵌入到可执行文件中,可以轻松避免这种情况。有关详细信息,请参阅《CUDA C 编程指南》(可在 NVIDIA 网站 上获取)部分3.1.1.2讲JIT编译。
第三点,场景创建,这是不可避免的,但NVIDIA已经下了很大的功夫来降低成本。上下文创建涉及将可执行代码复制到设备、复制任何数据对象、设置内存系统等。
Is it possible that the large delay you are seeing (nearly 1s) is due to driver initialisation? It seems rather long for a cudaMalloc. Also check your driver is up-to-date.
The delay for the first kernel launch can be due to a number of factors:
The first of these is only applicable if you are running on a Linux system without X. In that case the driver is only loaded when required and unloaded afterwards. Running
nvidia-smi -pm 1
as root will run the driver in persistent mode to avoid such delays, check outman nvidia-smi
for details and remember to add this to an init script since it won't persist across a reboot.The second delay is in compiling the PTX for the specific device architecture in your system. This is easily avoided by embedding the binary for your device architecture (or architectures if you want to support multiple archs without compiling PTX) into the executable. See the CUDA C Programming Guide (available on NVIDIA website) for more information, section 3.1.1.2 talks about JIT compilation.
The third point, context creation, is unavoidable but NVIDIA have gone to great effort to reduce the cost. Context creation involves copying the executable code to the device, copying any data objects, setting up the memory system etc.
这是可以理解的。 nvcc 将 ptx 代码嵌入到应用程序二进制文件中,该二进制文件必须使用 JIT 编译器编译为本机 GPU 二进制文件。这就是启动延迟的原因。 AFAIK malloc 并不比 memcpy 慢。
cudaRegisterFatBinary 和 cudaRegisterFunction 也确实由 nvcc 插入到您的代码中,以在运行时注册您的内核及其入口点。我想这就是你所说的初始化。
It is understandable. The nvcc embeds ptx code into the application binary which has to compiled to native gpu binary using a JIT compiler. This accounts for the start up delay. AFAIK malloc is not slower than memcpy .
It is also true that cudaRegisterFatBinary and cudaRegisterFunction are inserted by nvcc into your code to register your kernel and its entry point with the runtime. I guess this is the initialization you are talking about.