CUDA 内核代码的设备内存:是否可以显式管理?

发布于 2024-12-01 04:04:18 字数 537 浏览 0 评论 0原文

语境: CUDA 4.0、Linux 64 位、NVIDIA UNIX x86_64 内核模块 270.41.19、GeForce GTX 480。

我尝试在程序中查找(设备)内存泄漏。我使用运行时 API 和 cudaGetMemInfo(free,total) 来测量设备内存使用情况。我注意到内核执行后出现了显着的损失(在本例中为 31M)。内核代码本身不分配任何设备内存。所以我猜它的内核代码保留在设备内存中。即使我也认为内核没有那么大。 (有没有办法确定内核的大小?)

内核代码何时加载到设备内存中?我猜想在执行主机代码行时:

kernel<<<geom>>>(params);

对吗? 调用后代码是否保留在设备内存中?如果是这样,我可以显式卸载代码吗?

我担心的是设备内存碎片。考虑大量交替的设备内存分配和内核执行(不同的内核)。一段时间后,设备内存变得相当稀缺。即使您释放了一些内存,内核代码仍然只留下内核之间的空间可供新分配。一段时间后,这将导致巨大的内存碎片。 CUDA就是这样设计的吗?

Context:
CUDA 4.0, Linux 64bit, NVIDIA UNIX x86_64 Kernel Module 270.41.19, on a GeForce GTX 480.

I try to find a (device) memory leak in my program. I use the runtime API and cudaGetMemInfo(free,total) to measure device memory usage. I notice a significant loss (in this case 31M) after kernel execution. The kernel code itself does not allocate any device memory. So I guess its the kernel code that remains in device memory. Even I would have thought the kernel isn't that big. (Is there a way to determine the size of a kernel?)

When is the kernel code loaded into device memory? I guess at execution of the host code line:

kernel<<<geom>>>(params);

Right?
And does the code remain in device memory after the call? If so, can I explicitly unload the code?

What concerns me is device memory fragmentation. Think of a large sequence of alternating device memory allocation and kernel executions (different kernels). Then after a while device memory gets quite scarce. Even if you free some memory the kernel code remains leaving only the space between the kernels free for new allocation. This would result in a huge memory fragmentation after a while. Is this the way CUDA was designed?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

萌吟 2024-12-08 04:04:18

您观察到的内存分配由 CUDA 上下文使用。它不仅保存内核代码,还保存任何其他静态范围设备符号、纹理、本地内存的每线程暂存空间、printf 和堆、常量内存以及驱动程序和 CUDA 运行时本身所需的 GPU 内存。大多数内存仅在加载二进制模块或由驱动程序 JIT 编译 PTX 代码时分配一次。最好将其视为固定开销,而不是泄漏。 PTX 代码中存在 200 万条指令限制,并且当前硬件使用 32 位字来执行指令,因此即使是允许的最大内核代码的内存占用量也比它所需的其他全局内存开销小。

在最新版本的 CUDA 中,有一个运行时 API 调用 cudaDeviceSetLimit 允许对给定上下文可以消耗的临时空间量进行一些控制。请注意,可以将限制设置为低于设备代码所需的值,在这种情况下可能会导致运行时执行失败。

The memory allocation you are observing is used by the CUDA context. It doesn't only hold kernel code, it holds any other static scope device symbols, textures, per-thread scratch space for local memory, printf and heap, constant memory, as well as gpu memory required by the driver and CUDA runtime itself. Most of this memory is only ever allocated once, when a binary module is loaded, or PTX code is JIT compiled by the driver. It is probably best to think of it as a fixed overhead, rather than a leak. There is a 2 million instruction limit in PTX code, and current hardware uses 32 bit words for instructions, so the memory footprint of even the largest permissible kernel code is small compared to the other global memory overheads it requires.

In recent versions of CUDA there is a runtime API call cudaDeviceSetLimit which permits some control over the amount of scratch space a given context can consume. Be aware that it is possible to set the limits to values which are lower than the device code requires, in which case runtime execution failures can result.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文