Matlab是否会因CUcontext缓存而导致Cuda内存泄漏?

发布于 2024-11-29 13:11:27 字数 9699 浏览 2 评论 0原文

计算后使用 cudaDeviceReset() 是从 Matlab 使用 GPU 的正常方法吗?我无法在最新版本的 Matlab 中使用 GPU 计算,因为我的 GPU 不支持 Compute Capability 1.3+,而且我不想为使用 cudaMemGetInfo() 等简单的 Cuda 函数而向 Accelereyes Jacket 支付大量费用或者我简单的 Cuda 内核。

从 Matlab 调用 Cuda 时,我发现一些非常令人沮丧的行为。在 Visual Studio 2008 中,我编写了一个简单的 DLL,它使用标准 MEX 接口来运行一个 Cuda 查询:设备上有多少 RAM 可用(清单 1)。

// cudaMemoryCheck.cpp : Defines the exported functions for the DLL application.

#include <mex.h>
#include <cuda.h>
#include <driver_types.h>
#include <cuda_runtime_api.h>

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    size_t free = 0, total = 0;
    cudaError_t result = cudaMemGetInfo(&free, &total);

    mexPrintf("free memory in bytes %u (%u MB), total memory in bytes %u (%u MB). ", free, free/1024/1024, total, total/1024/1024);

    if( total > 0 )
        mexPrintf("%2.2f%% free\n", (100.0*free)/total );
    else
        mexPrintf("\n");

    // this is the critical line!
    cudaDeviceReset();
}

我将项目编译为 Win32 DLL(发布模式),其中使用 DEF 文件导出 mexFunction,并将 DLL 文件扩展名重命名为 .mexw32。

当我从 Matlab 运行 cudaMemoryCheck 时,我发现如果 cudaDeviceReset() 被注释掉,我的 GPU 将泄漏内存。这是我的简单 Matlab 代码(清单 2):

addpath('C:\Users\admin\Documents\Visual Studio 2008\Projects\cudaMemoryCheck\Release')

for i=1:20
    clear mex
    cudaMemoryCheck;
end

在 Matlab 中运行这个函数,我看到:

free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free

当 cudaDeviceReset() 被注释掉时,Matlab 的输出非常不同:

free memory in bytes 37019648 (35 MB), total memory in bytes 244776960 (233 MB). 15.12% free
free memory in bytes 25092096 (23 MB), total memory in bytes 244776960 (233 MB). 10.25% free
free memory in bytes 13549568 (12 MB), total memory in bytes 244776960 (233 MB). 5.54% free
free memory in bytes 12107776 (11 MB), total memory in bytes 244776960 (233 MB). 4.95% free
free memory in bytes 8568832 (8 MB), total memory in bytes 244776960 (233 MB). 3.50% free
free memory in bytes 9617408 (9 MB), total memory in bytes 244776960 (233 MB). 3.93% free
free memory in bytes 6078464 (5 MB), total memory in bytes 244776960 (233 MB). 2.48% free
free memory in bytes 8044544 (7 MB), total memory in bytes 244776960 (233 MB). 3.29% free
free memory in bytes 5816320 (5 MB), total memory in bytes 244776960 (233 MB). 2.38% free
free memory in bytes 7520256 (7 MB), total memory in bytes 244776960 (233 MB). 3.07% free
free memory in bytes 8830976 (8 MB), total memory in bytes 244776960 (233 MB). 3.61% free
free memory in bytes 5292032 (5 MB), total memory in bytes 244776960 (233 MB). 2.16% free
free memory in bytes 3407872 (3 MB), total memory in bytes 244776960 (233 MB). 1.39% free
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 

所以我得出的结论是,即使我的 MEX 函数没有在GPU,每次 MEX 函数运行时,Cuda Runtime API 都会创建新的 CUcontext,并且在我关闭 Matlab 或使用 cudaDeviceReset() 之前它永远不会清除它们。尽管我没有在 GPU 上分配任何东西,但最终 GPU 还是耗尽了内存!

我不喜欢使用 cudaDeviceReset()。 API 表示,“函数 cudaDeviceReset() 将立即取消初始化调用线程当前设备的主上下文”并且“调用者有责任确保该函数在调用该函数时不会被进程中的任何其他主机线程访问”被称为”。换句话说,使用 cudaDeviceReset() 可以立即终止其他 GPU 计算,并且不会发出警告。我还没有找到任何文档表明频繁使用 cudaDeviceReset() 是正常的,所以我不想这样做。我将接受这里任何证明使用 cudaDeviceReset() 是正常且必需的答案。

版本信息:NVIDIA GPU 计算工具包 4.0、Matlab 7.8.0(R2009a,32 位)、Windows 7 Enterprise SP1(64 位)、Nvidia Quadro NVS 420(最新 Nvidia 驱动程序,270.81)。

我还可以使用 GeForce 8400 GS、相同的 Matlab、Visual Studio 和 GPU 计算工具包在 Windows XP(32 位、SP3)上重现此问题。

deviceQuery.exe 的输出:

deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Quadro NVS 420"
  CUDA Driver Version / Runtime Version          4.0 / 4.0
  CUDA Capability Major/Minor version number:    1.1
  Total amount of global memory:                 233 MBytes (244776960 bytes)
  ( 1) Multiprocessors x ( 8) CUDA Cores/MP:     8 CUDA Cores
  GPU Clock Speed:                               1.40 GHz
  Memory Clock rate:                             700.00 Mhz
  Memory Bus Width:                              64-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and execution:                 No with 0 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   No
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro NVS 420"
  CUDA Driver Version / Runtime Version          4.0 / 4.0
  CUDA Capability Major/Minor version number:    1.1
  Total amount of global memory:                 234 MBytes (244908032 bytes)
  ( 1) Multiprocessors x ( 8) CUDA Cores/MP:     8 CUDA Cores
  GPU Clock Speed:                               1.40 GHz
  Memory Clock rate:                             700.00 Mhz
  Memory Bus Width:                              64-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and execution:                 No with 0 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   No
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Quadro NVS 420, Device = Quadro NVS 420

Is using cudaDeviceReset() after computations the normal way to use the GPU from Matlab? I can't use the GPU computation in the latest version of Matlab because my GPU doesn't support Compute Capability 1.3+, and I don't want to pay tons of money to Accelereyes Jacket for using a simple Cuda function like cudaMemGetInfo() or my simple Cuda kernels.

I've found some very frustrating behavior when calling Cuda from Matlab. In Visual Studio 2008, I wrote a trivial DLL which uses the standard MEX interface to run one Cuda query: how much RAM is free on the device (Listing 1).

// cudaMemoryCheck.cpp : Defines the exported functions for the DLL application.

#include <mex.h>
#include <cuda.h>
#include <driver_types.h>
#include <cuda_runtime_api.h>

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    size_t free = 0, total = 0;
    cudaError_t result = cudaMemGetInfo(&free, &total);

    mexPrintf("free memory in bytes %u (%u MB), total memory in bytes %u (%u MB). ", free, free/1024/1024, total, total/1024/1024);

    if( total > 0 )
        mexPrintf("%2.2f%% free\n", (100.0*free)/total );
    else
        mexPrintf("\n");

    // this is the critical line!
    cudaDeviceReset();
}

I compile the project a Win32 DLL (release mode) where I export mexFunction using a DEF file, and rename the DLL file extension to .mexw32.

When I run cudaMemoryCheck from Matlab, I find that my GPU will leak memory if the cudaDeviceReset() is commented out. Here's my trivial Matlab code (Listing 2):

addpath('C:\Users\admin\Documents\Visual Studio 2008\Projects\cudaMemoryCheck\Release')

for i=1:20
    clear mex
    cudaMemoryCheck;
end

Running this function in Matlab, I see:

free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free

The output from Matlab is very different when cudaDeviceReset() is commented out:

free memory in bytes 37019648 (35 MB), total memory in bytes 244776960 (233 MB). 15.12% free
free memory in bytes 25092096 (23 MB), total memory in bytes 244776960 (233 MB). 10.25% free
free memory in bytes 13549568 (12 MB), total memory in bytes 244776960 (233 MB). 5.54% free
free memory in bytes 12107776 (11 MB), total memory in bytes 244776960 (233 MB). 4.95% free
free memory in bytes 8568832 (8 MB), total memory in bytes 244776960 (233 MB). 3.50% free
free memory in bytes 9617408 (9 MB), total memory in bytes 244776960 (233 MB). 3.93% free
free memory in bytes 6078464 (5 MB), total memory in bytes 244776960 (233 MB). 2.48% free
free memory in bytes 8044544 (7 MB), total memory in bytes 244776960 (233 MB). 3.29% free
free memory in bytes 5816320 (5 MB), total memory in bytes 244776960 (233 MB). 2.38% free
free memory in bytes 7520256 (7 MB), total memory in bytes 244776960 (233 MB). 3.07% free
free memory in bytes 8830976 (8 MB), total memory in bytes 244776960 (233 MB). 3.61% free
free memory in bytes 5292032 (5 MB), total memory in bytes 244776960 (233 MB). 2.16% free
free memory in bytes 3407872 (3 MB), total memory in bytes 244776960 (233 MB). 1.39% free
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB). 

So I've concluded that even though my MEX function allocates no memory on the GPU, the Cuda Runtime API is creating new CUcontexts every time the MEX function runs, and it never clears them until I close Matlab or I use cudaDeviceReset(). Eventually the GPU runs out of memory despite the fact that I did not allocate anything on it!

I do not like using cudaDeviceReset(). The API says, "The function cudaDeviceReset() will deinitialize the primary context for the calling thread's current device immediately" and "It is the caller's responsibility to ensure that the device is not being accessed by any other host threads from the process when this function is called." In other words, using cudaDeviceReset() could terminate other GPU calculations immediately and without warning. I have not found any documentation that using cudaDeviceReset() frequently is normal, so I don't want to do it. I will accept any answer here that proves that using cudaDeviceReset() is normal and required.

Version info: NVIDIA GPU Computing Toolkit 4.0, Matlab 7.8.0 (R2009a, 32-bit), Windows 7 Enterprise SP1 (64-bit), Nvidia Quadro NVS 420 (latest Nvidia drivers, 270.81).

I can also reproduce this problem on Windows XP (32-bit, SP3) with a GeForce 8400 GS, same Matlab, Visual Studio, and GPU Computing Toolkit.

Output of deviceQuery.exe:

deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Quadro NVS 420"
  CUDA Driver Version / Runtime Version          4.0 / 4.0
  CUDA Capability Major/Minor version number:    1.1
  Total amount of global memory:                 233 MBytes (244776960 bytes)
  ( 1) Multiprocessors x ( 8) CUDA Cores/MP:     8 CUDA Cores
  GPU Clock Speed:                               1.40 GHz
  Memory Clock rate:                             700.00 Mhz
  Memory Bus Width:                              64-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and execution:                 No with 0 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   No
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro NVS 420"
  CUDA Driver Version / Runtime Version          4.0 / 4.0
  CUDA Capability Major/Minor version number:    1.1
  Total amount of global memory:                 234 MBytes (244908032 bytes)
  ( 1) Multiprocessors x ( 8) CUDA Cores/MP:     8 CUDA Cores
  GPU Clock Speed:                               1.40 GHz
  Memory Clock rate:                             700.00 Mhz
  Memory Bus Width:                              64-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and execution:                 No with 0 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   No
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Quadro NVS 420, Device = Quadro NVS 420

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

渔村楼浪 2024-12-06 13:11:27

我认为您不需要使用 cudaDeviceReset,如果您省略对 clear mex 的调用会发生什么?你首先为什么要这样做?这将导致 MATLAB 卸载您的 MEX 文件,我怀疑这是内存泄漏的根源。

I don't think you should need to use cudaDeviceReset, what happens if you omit the call to clear mex? Why are you doing that in the first place? That will cause MATLAB to unload your MEX file, and I suspect that is at the root of the memory leak.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文