循环展开时出现“资源不足”错误
当我将内核中的展开循环从 8 个增加到 9 个时,它会因 资源不足
错误而中断。
我读到 如何我是否诊断由于资源不足而导致 CUDA 启动失败? 参数不匹配和寄存器过度使用可能是一个问题,但这里似乎不是这种情况。
我的内核计算 n 个点和 m 个质心之间的距离,并为每个点选择最近的质心。它适用于 8 维,但不适用于 9 维。当我设置 dimensions=9
并取消注释距离计算的两行时,我得到一个 pycuda._driver.LaunchError: cuLaunchGrid failed: launch out资源
。
您认为什么可能导致这种行为?还有哪些其他原因会导致资源不足
*?
我使用的是 Quadro FX580。这是最小的例子。为了在实际代码中展开,我使用模板。
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit
## preference
np.random.seed(20)
points = 512
dimensions = 8
nclusters = 1
## init data
data = np.random.randn(points,dimensions).astype(np.float32)
clusters = data[:nclusters]
## init cuda
kernel_code = """
// the kernel definition
__device__ __constant__ float centroids[16384];
__global__ void kmeans_kernel(float *idata,float *g_centroids,
int * cluster, float *min_dist, int numClusters, int numDim) {
int valindex = blockIdx.x * blockDim.x + threadIdx.x ;
float increased_distance,distance, minDistance;
minDistance = 10000000 ;
int nearestCentroid = 0;
for(int k=0;k<numClusters;k++){
distance = 0.0;
increased_distance = idata[valindex*numDim] -centroids[k*numDim];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+1] -centroids[k*numDim+1];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+2] -centroids[k*numDim+2];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+3] -centroids[k*numDim+3];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+4] -centroids[k*numDim+4];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+5] -centroids[k*numDim+5];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+6] -centroids[k*numDim+6];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+7] -centroids[k*numDim+7];
distance = distance +(increased_distance * increased_distance);
//increased_distance = idata[valindex*numDim+8] -centroids[k*numDim+8];
//distance = distance +(increased_distance * increased_distance);
if(distance <minDistance) {
minDistance = distance ;
nearestCentroid = k;
}
}
cluster[valindex]=nearestCentroid;
min_dist[valindex]=sqrt(minDistance);
}
"""
mod = compiler.SourceModule(kernel_code)
centroids_adrs = mod.get_global('centroids')[0]
kmeans_kernel = mod.get_function("kmeans_kernel")
clusters_gpu = gpuarray.to_gpu(clusters)
cluster = gpuarray.zeros(points, dtype=np.int32)
min_dist = gpuarray.zeros(points, dtype=np.float32)
driver.memcpy_htod(centroids_adrs,clusters)
distortion = gpuarray.zeros(points, dtype=np.float32)
block_size= 512
## start kernel
kmeans_kernel(
driver.In(data),driver.In(clusters),cluster,min_dist,
np.int32(nclusters),np.int32(dimensions),
grid = (points/block_size,1),
block = (block_size, 1, 1),
)
print cluster
print min_dist
When I increase the unrolling from 8 to 9 loops in my kernel, it breaks with an out of resources
error.
I read in How do I diagnose a CUDA launch failure due to being out of resources? that a mismatch of parameters and an overuse of registers could be a problem, but that seems not be the case here.
My kernel calculates the distance between n
points and m
centroids and selects for each point the closest centroid. It works for 8 dimensions but not for 9. When I set dimensions=9
and uncomment the two lines for the distance calculation, I get an pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources
.
What do you think, could cause this behavior? What other iusses can cause an out of resources
*?
I use an Quadro FX580. Here is the minimal(ish) example. For the unrolling in the real code I use templates.
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit
## preference
np.random.seed(20)
points = 512
dimensions = 8
nclusters = 1
## init data
data = np.random.randn(points,dimensions).astype(np.float32)
clusters = data[:nclusters]
## init cuda
kernel_code = """
// the kernel definition
__device__ __constant__ float centroids[16384];
__global__ void kmeans_kernel(float *idata,float *g_centroids,
int * cluster, float *min_dist, int numClusters, int numDim) {
int valindex = blockIdx.x * blockDim.x + threadIdx.x ;
float increased_distance,distance, minDistance;
minDistance = 10000000 ;
int nearestCentroid = 0;
for(int k=0;k<numClusters;k++){
distance = 0.0;
increased_distance = idata[valindex*numDim] -centroids[k*numDim];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+1] -centroids[k*numDim+1];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+2] -centroids[k*numDim+2];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+3] -centroids[k*numDim+3];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+4] -centroids[k*numDim+4];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+5] -centroids[k*numDim+5];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+6] -centroids[k*numDim+6];
distance = distance +(increased_distance * increased_distance);
increased_distance = idata[valindex*numDim+7] -centroids[k*numDim+7];
distance = distance +(increased_distance * increased_distance);
//increased_distance = idata[valindex*numDim+8] -centroids[k*numDim+8];
//distance = distance +(increased_distance * increased_distance);
if(distance <minDistance) {
minDistance = distance ;
nearestCentroid = k;
}
}
cluster[valindex]=nearestCentroid;
min_dist[valindex]=sqrt(minDistance);
}
"""
mod = compiler.SourceModule(kernel_code)
centroids_adrs = mod.get_global('centroids')[0]
kmeans_kernel = mod.get_function("kmeans_kernel")
clusters_gpu = gpuarray.to_gpu(clusters)
cluster = gpuarray.zeros(points, dtype=np.int32)
min_dist = gpuarray.zeros(points, dtype=np.float32)
driver.memcpy_htod(centroids_adrs,clusters)
distortion = gpuarray.zeros(points, dtype=np.float32)
block_size= 512
## start kernel
kmeans_kernel(
driver.In(data),driver.In(clusters),cluster,min_dist,
np.int32(nclusters),np.int32(dimensions),
grid = (points/block_size,1),
block = (block_size, 1, 1),
)
print cluster
print min_dist
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于您的
block_size
(512) 太大,您的寄存器已用完。ptxas
通过注释行报告您的内核使用 16 个寄存器:取消注释这些行会将寄存器使用增加到 17 个,并在运行时出现错误:
内核每个线程使用的物理寄存器数量限制了您可以在运行时启动的块。 SM 1.0 设备有 8K 个寄存器可供线程块使用。我们可以将其与内核的寄存器需求进行比较:
17 * 512 = 8704 > 8K。在 16 个寄存器处,您原来的注释内核只是吱吱作响:
16 * 512 = 8192 == 8K
。当未指定架构时,
nvcc
默认为 SM 1.0 设备编译内核。 PyCUDA 可能以同样的方式工作。要解决您的问题,您可以减小
block_size
(例如,256)或找到一种方法来配置 PyCUDA 来为 SM 2.0 设备编译内核。 SM 2.0 设备(例如 QuadroFX 580)提供 32K 寄存器,对于您的原始block_size
512 来说绰绰有余。You're running out of registers because your
block_size
(512) is too large.ptxas
reports that your kernel uses 16 registers with the commented lines:Uncommenting the lines increases register use to 17 and an error at runtime:
The number of physical registers used by each thread of a kernel limits the size of blocks you can launch at runtime. An SM 1.0 device has 8K registers that can be used by a block of threads. We can compare that to your kernel's register demands:
17 * 512 = 8704 > 8K
. At 16 registers, your original commented kernel just squeaks by:16 * 512 = 8192 == 8K
.When no architecture is specified,
nvcc
compiles kernels for an SM 1.0 device by default. PyCUDA may work the same way.To fix your problem, you could either decrease
block_size
(to say, 256) or find a way to configure PyCUDA to compile your kernel for an SM 2.0 device. SM 2.0 devices such as your QuadroFX 580 provide 32K registers, more than enough for your originalblock_size
of 512.