Matlab是否会因CUcontext缓存而导致Cuda内存泄漏?
计算后使用 cudaDeviceReset() 是从 Matlab 使用 GPU 的正常方法吗?我无法在最新版本的 Matlab 中使用 GPU 计算,因为我的 GPU 不支持 Compute Capability 1.3+,而且我不想为使用 cudaMemGetInfo() 等简单的 Cuda 函数而向 Accelereyes Jacket 支付大量费用或者我简单的 Cuda 内核。
从 Matlab 调用 Cuda 时,我发现一些非常令人沮丧的行为。在 Visual Studio 2008 中,我编写了一个简单的 DLL,它使用标准 MEX 接口来运行一个 Cuda 查询:设备上有多少 RAM 可用(清单 1)。
// cudaMemoryCheck.cpp : Defines the exported functions for the DLL application.
#include <mex.h>
#include <cuda.h>
#include <driver_types.h>
#include <cuda_runtime_api.h>
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
size_t free = 0, total = 0;
cudaError_t result = cudaMemGetInfo(&free, &total);
mexPrintf("free memory in bytes %u (%u MB), total memory in bytes %u (%u MB). ", free, free/1024/1024, total, total/1024/1024);
if( total > 0 )
mexPrintf("%2.2f%% free\n", (100.0*free)/total );
else
mexPrintf("\n");
// this is the critical line!
cudaDeviceReset();
}
我将项目编译为 Win32 DLL(发布模式),其中使用 DEF 文件导出 mexFunction,并将 DLL 文件扩展名重命名为 .mexw32。
当我从 Matlab 运行 cudaMemoryCheck 时,我发现如果 cudaDeviceReset() 被注释掉,我的 GPU 将泄漏内存。这是我的简单 Matlab 代码(清单 2):
addpath('C:\Users\admin\Documents\Visual Studio 2008\Projects\cudaMemoryCheck\Release')
for i=1:20
clear mex
cudaMemoryCheck;
end
在 Matlab 中运行这个函数,我看到:
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
当 cudaDeviceReset() 被注释掉时,Matlab 的输出非常不同:
free memory in bytes 37019648 (35 MB), total memory in bytes 244776960 (233 MB). 15.12% free
free memory in bytes 25092096 (23 MB), total memory in bytes 244776960 (233 MB). 10.25% free
free memory in bytes 13549568 (12 MB), total memory in bytes 244776960 (233 MB). 5.54% free
free memory in bytes 12107776 (11 MB), total memory in bytes 244776960 (233 MB). 4.95% free
free memory in bytes 8568832 (8 MB), total memory in bytes 244776960 (233 MB). 3.50% free
free memory in bytes 9617408 (9 MB), total memory in bytes 244776960 (233 MB). 3.93% free
free memory in bytes 6078464 (5 MB), total memory in bytes 244776960 (233 MB). 2.48% free
free memory in bytes 8044544 (7 MB), total memory in bytes 244776960 (233 MB). 3.29% free
free memory in bytes 5816320 (5 MB), total memory in bytes 244776960 (233 MB). 2.38% free
free memory in bytes 7520256 (7 MB), total memory in bytes 244776960 (233 MB). 3.07% free
free memory in bytes 8830976 (8 MB), total memory in bytes 244776960 (233 MB). 3.61% free
free memory in bytes 5292032 (5 MB), total memory in bytes 244776960 (233 MB). 2.16% free
free memory in bytes 3407872 (3 MB), total memory in bytes 244776960 (233 MB). 1.39% free
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
所以我得出的结论是,即使我的 MEX 函数没有在GPU,每次 MEX 函数运行时,Cuda Runtime API 都会创建新的 CUcontext,并且在我关闭 Matlab 或使用 cudaDeviceReset() 之前它永远不会清除它们。尽管我没有在 GPU 上分配任何东西,但最终 GPU 还是耗尽了内存!
我不喜欢使用 cudaDeviceReset()。 API 表示,“函数 cudaDeviceReset() 将立即取消初始化调用线程当前设备的主上下文”并且“调用者有责任确保该函数在调用该函数时不会被进程中的任何其他主机线程访问”被称为”。换句话说,使用 cudaDeviceReset() 可以立即终止其他 GPU 计算,并且不会发出警告。我还没有找到任何文档表明频繁使用 cudaDeviceReset() 是正常的,所以我不想这样做。我将接受这里任何证明使用 cudaDeviceReset() 是正常且必需的答案。
版本信息:NVIDIA GPU 计算工具包 4.0、Matlab 7.8.0(R2009a,32 位)、Windows 7 Enterprise SP1(64 位)、Nvidia Quadro NVS 420(最新 Nvidia 驱动程序,270.81)。
我还可以使用 GeForce 8400 GS、相同的 Matlab、Visual Studio 和 GPU 计算工具包在 Windows XP(32 位、SP3)上重现此问题。
deviceQuery.exe 的输出:
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Quadro NVS 420"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 233 MBytes (244776960 bytes)
( 1) Multiprocessors x ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock Speed: 1.40 GHz
Memory Clock rate: 700.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro NVS 420"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 234 MBytes (244908032 bytes)
( 1) Multiprocessors x ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock Speed: 1.40 GHz
Memory Clock rate: 700.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Quadro NVS 420, Device = Quadro NVS 420
Is using cudaDeviceReset() after computations the normal way to use the GPU from Matlab? I can't use the GPU computation in the latest version of Matlab because my GPU doesn't support Compute Capability 1.3+, and I don't want to pay tons of money to Accelereyes Jacket for using a simple Cuda function like cudaMemGetInfo() or my simple Cuda kernels.
I've found some very frustrating behavior when calling Cuda from Matlab. In Visual Studio 2008, I wrote a trivial DLL which uses the standard MEX interface to run one Cuda query: how much RAM is free on the device (Listing 1).
// cudaMemoryCheck.cpp : Defines the exported functions for the DLL application.
#include <mex.h>
#include <cuda.h>
#include <driver_types.h>
#include <cuda_runtime_api.h>
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
size_t free = 0, total = 0;
cudaError_t result = cudaMemGetInfo(&free, &total);
mexPrintf("free memory in bytes %u (%u MB), total memory in bytes %u (%u MB). ", free, free/1024/1024, total, total/1024/1024);
if( total > 0 )
mexPrintf("%2.2f%% free\n", (100.0*free)/total );
else
mexPrintf("\n");
// this is the critical line!
cudaDeviceReset();
}
I compile the project a Win32 DLL (release mode) where I export mexFunction using a DEF file, and rename the DLL file extension to .mexw32.
When I run cudaMemoryCheck from Matlab, I find that my GPU will leak memory if the cudaDeviceReset() is commented out. Here's my trivial Matlab code (Listing 2):
addpath('C:\Users\admin\Documents\Visual Studio 2008\Projects\cudaMemoryCheck\Release')
for i=1:20
clear mex
cudaMemoryCheck;
end
Running this function in Matlab, I see:
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
The output from Matlab is very different when cudaDeviceReset() is commented out:
free memory in bytes 37019648 (35 MB), total memory in bytes 244776960 (233 MB). 15.12% free
free memory in bytes 25092096 (23 MB), total memory in bytes 244776960 (233 MB). 10.25% free
free memory in bytes 13549568 (12 MB), total memory in bytes 244776960 (233 MB). 5.54% free
free memory in bytes 12107776 (11 MB), total memory in bytes 244776960 (233 MB). 4.95% free
free memory in bytes 8568832 (8 MB), total memory in bytes 244776960 (233 MB). 3.50% free
free memory in bytes 9617408 (9 MB), total memory in bytes 244776960 (233 MB). 3.93% free
free memory in bytes 6078464 (5 MB), total memory in bytes 244776960 (233 MB). 2.48% free
free memory in bytes 8044544 (7 MB), total memory in bytes 244776960 (233 MB). 3.29% free
free memory in bytes 5816320 (5 MB), total memory in bytes 244776960 (233 MB). 2.38% free
free memory in bytes 7520256 (7 MB), total memory in bytes 244776960 (233 MB). 3.07% free
free memory in bytes 8830976 (8 MB), total memory in bytes 244776960 (233 MB). 3.61% free
free memory in bytes 5292032 (5 MB), total memory in bytes 244776960 (233 MB). 2.16% free
free memory in bytes 3407872 (3 MB), total memory in bytes 244776960 (233 MB). 1.39% free
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
So I've concluded that even though my MEX function allocates no memory on the GPU, the Cuda Runtime API is creating new CUcontexts every time the MEX function runs, and it never clears them until I close Matlab or I use cudaDeviceReset(). Eventually the GPU runs out of memory despite the fact that I did not allocate anything on it!
I do not like using cudaDeviceReset(). The API says, "The function cudaDeviceReset() will deinitialize the primary context for the calling thread's current device immediately" and "It is the caller's responsibility to ensure that the device is not being accessed by any other host threads from the process when this function is called." In other words, using cudaDeviceReset() could terminate other GPU calculations immediately and without warning. I have not found any documentation that using cudaDeviceReset() frequently is normal, so I don't want to do it. I will accept any answer here that proves that using cudaDeviceReset() is normal and required.
Version info: NVIDIA GPU Computing Toolkit 4.0, Matlab 7.8.0 (R2009a, 32-bit), Windows 7 Enterprise SP1 (64-bit), Nvidia Quadro NVS 420 (latest Nvidia drivers, 270.81).
I can also reproduce this problem on Windows XP (32-bit, SP3) with a GeForce 8400 GS, same Matlab, Visual Studio, and GPU Computing Toolkit.
Output of deviceQuery.exe:
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Quadro NVS 420"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 233 MBytes (244776960 bytes)
( 1) Multiprocessors x ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock Speed: 1.40 GHz
Memory Clock rate: 700.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro NVS 420"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 234 MBytes (244908032 bytes)
( 1) Multiprocessors x ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock Speed: 1.40 GHz
Memory Clock rate: 700.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Quadro NVS 420, Device = Quadro NVS 420
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为您不需要使用
cudaDeviceReset
,如果您省略对clear mex
的调用会发生什么?你首先为什么要这样做?这将导致 MATLAB 卸载您的 MEX 文件,我怀疑这是内存泄漏的根源。I don't think you should need to use
cudaDeviceReset
, what happens if you omit the call toclear mex
? Why are you doing that in the first place? That will cause MATLAB to unload your MEX file, and I suspect that is at the root of the memory leak.