程序启动时堆栈溢出异常(CUDA Monte Carlo Pi)

发布于 2024-11-10 12:54:10 字数 3164 浏览 4 评论 0原文

我的问题是,当程序首次进入 main 时,我在程序启动时收到堆栈溢出异常。我的程序是一个使用 CUDA 的并行蒙特卡罗 Pi 计算器。当我尝试在 Visual Studio 中调试程序时,在我可以选择的任何断点之前都会弹出异常。任何帮助表示赞赏。

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>

#define NUM_THREAD 512
#define NUM_BLOCK 65534

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];

// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();

// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
    if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int  trials, curandState *states){
    extern __shared__ int sdata[];
//  unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int k, incircle;
    float x, y, z;
    incircle = 0;

    curand_init(1234, i, 0, &states[i]);

    for(k = 0; k < trials; k++){

    x = curand_uniform(&states[i]);
    y = curand_uniform(&states[i]);
    z = sqrt(x*x + y*y);
    if (z <= 1) incircle++;
    else{}
    }
    __syncthreads();
    g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {

    float* solution = (float*)calloc(100, sizeof(float));
    float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];
    int trials, total; 
    curandState *devStates;



    trials = 100;
    total = trials*NUM_THREAD*NUM_BLOCK;

    dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
    dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
    size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
    cudaMalloc((void **) &sumDev, size); // Allocate array on device
    cudaMalloc((void **) &devStates, size*sizeof(curandState));
    // Do calculation on device by calling CUDA kernel
    monteCarlo <<<dimGrid, dimBlock, size>>> (sumDev, trials, devStates);
        // call reduction function to sum
    reduce0 <<<dimGrid, dimBlock, size>>> (sumDev);
    // Retrieve result from device and store it in host array
    cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);

    *solution = 4*(sumHost[0]/total);
    printf("%.*f\n", 1000, *solution);
    free (solution);
    //*solution = NULL;
    return 0;
}

My problem is that I am receiving a stack overflow exception at program start when the program first enters main. My program is a Parallel Monte Carlo Pi calculator using CUDA. When I try and debug the program in Visual Studio, the exception pops up before any breakpoint I can select. Any help is appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>

#define NUM_THREAD 512
#define NUM_BLOCK 65534

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];

// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();

// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
    if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int  trials, curandState *states){
    extern __shared__ int sdata[];
//  unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int k, incircle;
    float x, y, z;
    incircle = 0;

    curand_init(1234, i, 0, &states[i]);

    for(k = 0; k < trials; k++){

    x = curand_uniform(&states[i]);
    y = curand_uniform(&states[i]);
    z = sqrt(x*x + y*y);
    if (z <= 1) incircle++;
    else{}
    }
    __syncthreads();
    g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {

    float* solution = (float*)calloc(100, sizeof(float));
    float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];
    int trials, total; 
    curandState *devStates;



    trials = 100;
    total = trials*NUM_THREAD*NUM_BLOCK;

    dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
    dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
    size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
    cudaMalloc((void **) &sumDev, size); // Allocate array on device
    cudaMalloc((void **) &devStates, size*sizeof(curandState));
    // Do calculation on device by calling CUDA kernel
    monteCarlo <<<dimGrid, dimBlock, size>>> (sumDev, trials, devStates);
        // call reduction function to sum
    reduce0 <<<dimGrid, dimBlock, size>>> (sumDev);
    // Retrieve result from device and store it in host array
    cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);

    *solution = 4*(sumHost[0]/total);
    printf("%.*f\n", 1000, *solution);
    free (solution);
    //*solution = NULL;
    return 0;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

乖乖 2024-11-17 12:54:10

我假设问题是这样的:

float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];

for

#define NUM_THREAD 512
#define NUM_BLOCK 65534

这给你留下了一个大约 130Mb 的静态声明数组。我怀疑编译器运行时库能否处理如此大的静态分配,这就是为什么你会立即出现堆栈溢出。将其替换为动态分配,堆栈溢出问题就会消失。但请阅读 Pavan 的帖子 小心,因为一旦修复了堆栈溢出,CUDA 代码本身也需要重新设计才能工作。

I would assume the problem is this:

float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];

for

#define NUM_THREAD 512
#define NUM_BLOCK 65534

That leaves you with a roughly 130Mb statically declared array. I doubt the compiler runtime library can deal with such a large static allocation, which is why you get an instant stack overflow. Replace it with a dynamic allocation and the stack overflow problem will go away. But then read Pavan's post carefully, because once you fix the stack overflow, the CUDA code itself also needs some redesign before it will work.

三生殊途 2024-11-17 12:54:10

您正在声明共享内存的大小 = size;就像这里

monteCarlo <<<dimGrid, dimBlock, size>>>

size 的值 = 512 * 65534 * 4 = 2^9 * 2^16 * 2^2 = 2^27 (超过我能想到的任何卡上共享内存的最大值)。

但是看看你的内核,我认为你希望共享内存等于你拥有的线程数。

所以你要么需要做

1)
这是为了启动你的内核

monteCarlo <<<dimGrid, dimBlock, (NUM_THREADS * sizeof(int))>>>

2)
或者用它来启动你的内核,

monteCarlo <<<dimGrid, dimBlock>>> 

并用它来声明你的内核中的共享内存。

__shared__ int sdata[NUM_THREADS]; // Note: no extern before __shared__

对于此类内核,我个人更喜欢方法二,因为共享内存与线程数量成正比,但已知线程数量是恒定的。它也稍微快一些。

编辑

除了上述问题之外,我怀疑这也可能会导致问题。

 cudaMalloc((void **) &devStates, size*sizeof(curandState));

因为尺寸本身就是这个。

size = NUM_BLOCKS * NUM_THREADS * sizeof(float);

也许您想这样做?

cudaMalloc((void **) &devStates, (NUM_BLOCKS *NUM_THREADS)*sizeof(curandState));

至于实际的堆栈溢出问题,您可能需要查看 talonmies 帖子

You are declaring the size of shared memory = size; like here

monteCarlo <<<dimGrid, dimBlock, size>>>

The value of size = 512 * 65534 * 4 = 2^9 * 2^16 * 2^2 = 2^27 (more than the maximum value of shared memory on any card I can think of).

But looking at your kernels, I think you want the shared memory to be equal to the number of threads you have.

So you either need to do

1)
this for launching your kernels

monteCarlo <<<dimGrid, dimBlock, (NUM_THREADS * sizeof(int))>>>

2)
Or use this for launching your kernels

monteCarlo <<<dimGrid, dimBlock>>> 

And this to declare your shared memory inside your kernel.

__shared__ int sdata[NUM_THREADS]; // Note: no extern before __shared__

I personally prefer method two for these kinds of kernels because the shared memory is proportional to the number of threads, but the number of threads is known to be constant. It is also slightly faster.

EDIT

Apart from the forementioned problems I doubt that this might be causing problems too.

 cudaMalloc((void **) &devStates, size*sizeof(curandState));

Becuase size itself is this.

size = NUM_BLOCKS * NUM_THREADS * sizeof(float);

May be you wanted to do this instead ?

cudaMalloc((void **) &devStates, (NUM_BLOCKS *NUM_THREADS)*sizeof(curandState));

As for the actual stack overflow problem you may want to look at talonmies post.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文