循环展开时出现“资源不足”错误

发布于 2024-12-07 04:02:38 字数 3532 浏览 3 评论 0原文

当我将内核中的展开循环从 8 个增加到 9 个时,它会因 资源不足 错误而中断。

我读到 如何我是否诊断由于资源不足而导致 CUDA 启动失败? 参数不匹配和寄存器过度使用可能是一个问题,但这里似乎不是这种情况。

我的内核计算 n 个点和 m 个质心之间的距离,并为每个点选择最近的质心。它适用于 8 维,但不适用于 9 维。当我设置 dimensions=9 并取消注释距离计算的两行时,我得到一个 pycuda._driver.LaunchError: cuLaunchGrid failed: launch out资源

您认为什么可能导致这种行为?还有哪些其他原因会导致资源不足*?

我使用的是 Quadro FX580。这是最小的例子。为了在实际代码中展开,我使用模板。

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit


## preference
np.random.seed(20)
points = 512
dimensions = 8
nclusters = 1

## init data
data = np.random.randn(points,dimensions).astype(np.float32)
clusters = data[:nclusters]

## init cuda
kernel_code = """

      // the kernel definition 
    __device__ __constant__ float centroids[16384];

    __global__ void kmeans_kernel(float *idata,float *g_centroids,
    int * cluster, float *min_dist, int numClusters, int numDim) {
    int valindex = blockIdx.x * blockDim.x + threadIdx.x ;
    float increased_distance,distance, minDistance;
    minDistance = 10000000 ;
    int nearestCentroid = 0;
    for(int k=0;k<numClusters;k++){
      distance = 0.0;
      increased_distance = idata[valindex*numDim] -centroids[k*numDim];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+1] -centroids[k*numDim+1];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+2] -centroids[k*numDim+2];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+3] -centroids[k*numDim+3];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+4] -centroids[k*numDim+4];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+5] -centroids[k*numDim+5];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+6] -centroids[k*numDim+6];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+7] -centroids[k*numDim+7];
      distance = distance +(increased_distance * increased_distance);
      //increased_distance =  idata[valindex*numDim+8] -centroids[k*numDim+8];
      //distance = distance +(increased_distance * increased_distance);

      if(distance <minDistance) {
        minDistance = distance ;
        nearestCentroid = k;
        } 
      }
      cluster[valindex]=nearestCentroid;
      min_dist[valindex]=sqrt(minDistance);
    } 
 """
mod = compiler.SourceModule(kernel_code)
centroids_adrs = mod.get_global('centroids')[0]    
kmeans_kernel = mod.get_function("kmeans_kernel")
clusters_gpu = gpuarray.to_gpu(clusters)
cluster = gpuarray.zeros(points, dtype=np.int32)
min_dist = gpuarray.zeros(points, dtype=np.float32)

driver.memcpy_htod(centroids_adrs,clusters)

distortion = gpuarray.zeros(points, dtype=np.float32)
block_size= 512

## start kernel
kmeans_kernel(
    driver.In(data),driver.In(clusters),cluster,min_dist,
    np.int32(nclusters),np.int32(dimensions),
    grid = (points/block_size,1),
    block = (block_size, 1, 1),
)
print cluster
print min_dist

When I increase the unrolling from 8 to 9 loops in my kernel, it breaks with an out of resources error.

I read in How do I diagnose a CUDA launch failure due to being out of resources? that a mismatch of parameters and an overuse of registers could be a problem, but that seems not be the case here.

My kernel calculates the distance between n points and m centroids and selects for each point the closest centroid. It works for 8 dimensions but not for 9. When I set dimensions=9 and uncomment the two lines for the distance calculation, I get an pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources.

What do you think, could cause this behavior? What other iusses can cause an out of resources*?

I use an Quadro FX580. Here is the minimal(ish) example. For the unrolling in the real code I use templates.

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit


## preference
np.random.seed(20)
points = 512
dimensions = 8
nclusters = 1

## init data
data = np.random.randn(points,dimensions).astype(np.float32)
clusters = data[:nclusters]

## init cuda
kernel_code = """

      // the kernel definition 
    __device__ __constant__ float centroids[16384];

    __global__ void kmeans_kernel(float *idata,float *g_centroids,
    int * cluster, float *min_dist, int numClusters, int numDim) {
    int valindex = blockIdx.x * blockDim.x + threadIdx.x ;
    float increased_distance,distance, minDistance;
    minDistance = 10000000 ;
    int nearestCentroid = 0;
    for(int k=0;k<numClusters;k++){
      distance = 0.0;
      increased_distance = idata[valindex*numDim] -centroids[k*numDim];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+1] -centroids[k*numDim+1];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+2] -centroids[k*numDim+2];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+3] -centroids[k*numDim+3];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+4] -centroids[k*numDim+4];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+5] -centroids[k*numDim+5];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+6] -centroids[k*numDim+6];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+7] -centroids[k*numDim+7];
      distance = distance +(increased_distance * increased_distance);
      //increased_distance =  idata[valindex*numDim+8] -centroids[k*numDim+8];
      //distance = distance +(increased_distance * increased_distance);

      if(distance <minDistance) {
        minDistance = distance ;
        nearestCentroid = k;
        } 
      }
      cluster[valindex]=nearestCentroid;
      min_dist[valindex]=sqrt(minDistance);
    } 
 """
mod = compiler.SourceModule(kernel_code)
centroids_adrs = mod.get_global('centroids')[0]    
kmeans_kernel = mod.get_function("kmeans_kernel")
clusters_gpu = gpuarray.to_gpu(clusters)
cluster = gpuarray.zeros(points, dtype=np.int32)
min_dist = gpuarray.zeros(points, dtype=np.float32)

driver.memcpy_htod(centroids_adrs,clusters)

distortion = gpuarray.zeros(points, dtype=np.float32)
block_size= 512

## start kernel
kmeans_kernel(
    driver.In(data),driver.In(clusters),cluster,min_dist,
    np.int32(nclusters),np.int32(dimensions),
    grid = (points/block_size,1),
    block = (block_size, 1, 1),
)
print cluster
print min_dist

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

平定天下 2024-12-14 04:02:38

由于您的 block_size (512) 太大,您的寄存器已用完。

ptxas 通过注释行报告您的内核使用 16 个寄存器:

$ nvcc test.cu -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 16 registers, 24+16 bytes smem, 65536 bytes cmem[0]

取消注释这些行会将寄存器使用增加到 17 个,并在运行时出现错误:

$ nvcc test.cu -run -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 17 registers, 24+16 bytes smem, 65536 bytes cmem[0]
error: too many resources requested for launch

内核每个线程使用的物理寄存器数量限制了您可以在运行时启动的块。 SM 1.0 设备有 8K 个寄存器可供线程块使用。我们可以将其与内核的寄存器需求进行比较:17 * 512 = 8704 > 8K。在 16 个寄存器处,您原来的注释内核只是吱吱作响:16 * 512 = 8192 == 8K

当未指定架构时,nvcc 默认为 SM 1.0 设备编译内核。 PyCUDA 可能以同样的方式工作。

要解决您的问题,您可以减小 block_size (例如,256)或找到一种方法来配置 PyCUDA 来为 SM 2.0 设备编译内核。 SM 2.0 设备(例如 QuadroFX 580)提供 32K 寄存器,对于您的原始 block_size 512 来说绰绰有余。

You're running out of registers because your block_size (512) is too large.

ptxas reports that your kernel uses 16 registers with the commented lines:

$ nvcc test.cu -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 16 registers, 24+16 bytes smem, 65536 bytes cmem[0]

Uncommenting the lines increases register use to 17 and an error at runtime:

$ nvcc test.cu -run -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 17 registers, 24+16 bytes smem, 65536 bytes cmem[0]
error: too many resources requested for launch

The number of physical registers used by each thread of a kernel limits the size of blocks you can launch at runtime. An SM 1.0 device has 8K registers that can be used by a block of threads. We can compare that to your kernel's register demands: 17 * 512 = 8704 > 8K. At 16 registers, your original commented kernel just squeaks by: 16 * 512 = 8192 == 8K.

When no architecture is specified, nvcc compiles kernels for an SM 1.0 device by default. PyCUDA may work the same way.

To fix your problem, you could either decrease block_size (to say, 256) or find a way to configure PyCUDA to compile your kernel for an SM 2.0 device. SM 2.0 devices such as your QuadroFX 580 provide 32K registers, more than enough for your original block_size of 512.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文