程序启动时堆栈溢出异常(CUDA Monte Carlo Pi)
我的问题是,当程序首次进入 main 时,我在程序启动时收到堆栈溢出异常。我的程序是一个使用 CUDA 的并行蒙特卡罗 Pi 计算器。当我尝试在 Visual Studio 中调试程序时,在我可以选择的任何断点之前都会弹出异常。任何帮助表示赞赏。
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 512
#define NUM_BLOCK 65534
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
extern __shared__ int sdata[];
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int k, incircle;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z = sqrt(x*x + y*y);
if (z <= 1) incircle++;
else{}
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];
int trials, total;
curandState *devStates;
trials = 100;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
cudaMalloc((void **) &sumDev, size); // Allocate array on device
cudaMalloc((void **) &devStates, size*sizeof(curandState));
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock, size>>> (sumDev, trials, devStates);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, size>>> (sumDev);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
//*solution = NULL;
return 0;
}
My problem is that I am receiving a stack overflow exception at program start when the program first enters main. My program is a Parallel Monte Carlo Pi calculator using CUDA. When I try and debug the program in Visual Studio, the exception pops up before any breakpoint I can select. Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 512
#define NUM_BLOCK 65534
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
extern __shared__ int sdata[];
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int k, incircle;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z = sqrt(x*x + y*y);
if (z <= 1) incircle++;
else{}
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];
int trials, total;
curandState *devStates;
trials = 100;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
cudaMalloc((void **) &sumDev, size); // Allocate array on device
cudaMalloc((void **) &devStates, size*sizeof(curandState));
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock, size>>> (sumDev, trials, devStates);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, size>>> (sumDev);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
//*solution = NULL;
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我假设问题是这样的:
for
这给你留下了一个大约 130Mb 的静态声明数组。我怀疑编译器运行时库能否处理如此大的静态分配,这就是为什么你会立即出现堆栈溢出。将其替换为动态分配,堆栈溢出问题就会消失。但请阅读 Pavan 的帖子 小心,因为一旦修复了堆栈溢出,CUDA 代码本身也需要重新设计才能工作。
I would assume the problem is this:
for
That leaves you with a roughly 130Mb statically declared array. I doubt the compiler runtime library can deal with such a large static allocation, which is why you get an instant stack overflow. Replace it with a dynamic allocation and the stack overflow problem will go away. But then read Pavan's post carefully, because once you fix the stack overflow, the CUDA code itself also needs some redesign before it will work.
您正在声明共享内存的大小 = size;就像这里
size 的值 = 512 * 65534 * 4 = 2^9 * 2^16 * 2^2 = 2^27 (超过我能想到的任何卡上共享内存的最大值)。
但是看看你的内核,我认为你希望共享内存等于你拥有的线程数。
所以你要么需要做
1)
这是为了启动你的内核
2)
或者用它来启动你的内核,
并用它来声明你的内核中的共享内存。
对于此类内核,我个人更喜欢方法二,因为共享内存与线程数量成正比,但已知线程数量是恒定的。它也稍微快一些。
编辑
除了上述问题之外,我怀疑这也可能会导致问题。
因为尺寸本身就是这个。
也许您想这样做?
至于实际的堆栈溢出问题,您可能需要查看 talonmies 帖子。
You are declaring the size of shared memory = size; like here
The value of size = 512 * 65534 * 4 = 2^9 * 2^16 * 2^2 = 2^27 (more than the maximum value of shared memory on any card I can think of).
But looking at your kernels, I think you want the shared memory to be equal to the number of threads you have.
So you either need to do
1)
this for launching your kernels
2)
Or use this for launching your kernels
And this to declare your shared memory inside your kernel.
I personally prefer method two for these kinds of kernels because the shared memory is proportional to the number of threads, but the number of threads is known to be constant. It is also slightly faster.
EDIT
Apart from the forementioned problems I doubt that this might be causing problems too.
Becuase size itself is this.
May be you wanted to do this instead ?
As for the actual stack overflow problem you may want to look at talonmies post.