CUDA 凸包程序在大输入时崩溃
我正在尝试在 CUDA 中并行实现 QuickHull 算法(对于凸包)。当 input_size <= 100 万时它可以正常工作。当我尝试 1000 万点时,程序崩溃了。我的显卡大小为 1982 MB,算法中的所有数据结构对于此输入大小总共需要不超过 600 MB,不到可用空间的 50%。
通过注释掉我的内核行,我发现当我尝试访问数组元素并且我尝试访问的元素的索引没有超出范围(双重检查)时会发生崩溃。以下是崩溃的内核代码。
for(unsigned int i = old_setIndex; i < old_setIndex + old_setS[tid]; i++)
{
int pI = old_set[i];
if(pI <= -1 || pI > pts.size())
{
printf("Thread %d: i = %d, pI = %d\n", tid, i, pI);
continue;
}
p = pts[pI];
double d = distance(A,B,p);
if(d > dist) {
dist = d;
furthestPoint = i;
fpi = pI;
}
}
//fpi = old_set[furthestPoint];
//printf("Thread %d: Furthestpoint = %d\n", tid, furthestPoint);
当我取消 for 循环后的语句(数组访问和 printf)的注释时,我的代码崩溃了。我无法解释该错误,因为 farthestPoint 始终在 old_set 数组大小的范围内。 Old_setS 存储每个线程可以操作的较小数组的大小。即使只是尝试打印furthestPoint(最后一行)的值,而上面没有数组访问语句,它也会崩溃。
上面的代码对于输入大小<=100万没有问题。如果是 1000 万,我是否会溢出设备中的某些缓冲区?
请帮助我找到崩溃的根源。
I am trying to implement quickHull algorithm (for convex hull) parallely in CUDA. It works correctly for input_size <= 1 million. When I try 10 million points, the program crashes. My graphic card size is 1982 MB and all my data structures in the algorithm collectively require not more than 600 MB for this input size, which is less than 50 % of the available space.
By commenting out lines of my kernels, I found out that the crash occurs when I try to access array element and the index of the element I am trying to access is not out of bounds (double checked). The following is the kernel code where it crashes.
for(unsigned int i = old_setIndex; i < old_setIndex + old_setS[tid]; i++)
{
int pI = old_set[i];
if(pI <= -1 || pI > pts.size())
{
printf("Thread %d: i = %d, pI = %d\n", tid, i, pI);
continue;
}
p = pts[pI];
double d = distance(A,B,p);
if(d > dist) {
dist = d;
furthestPoint = i;
fpi = pI;
}
}
//fpi = old_set[furthestPoint];
//printf("Thread %d: Furthestpoint = %d\n", tid, furthestPoint);
My code crashes when I uncomment the statements (array access and printf) after the for loop. I am unable to explain the error as furthestPoint is always within bounds of old_set array size. Old_setS stores the size of smaller arrays that each thread can operate on. It crashes even if just try to print the value of furthestPoint (last line) without the array access statement above it.
There's no problem with the above code for input size <= 1 million. Am I overflowing some buffer in the device in case of 10 million?
Please help me in finding the source of the crash.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的代码中没有越界内存访问(或者至少没有导致您所看到的症状的内存访问)。
发生的情况是您的内核被显示驱动程序杀死,因为它在显示 GPU 上执行花费了太多时间。所有 CUDA 平台显示驱动程序都包含 GPU 上任何操作的时间限制。这样做是为了防止显示冻结足够长的时间,导致操作系统内核出现恐慌或用户出现恐慌并认为机器已崩溃。在您使用的windows平台上,时间限制约为2秒。
部分误导您认为问题的根源在于数组寻址的是代码注释使问题消失了。但真正发生的是编译器优化的产物。当您注释掉全局内存写入时,编译器会识别出导致存储值的计算未使用,并且它会从它发出的汇编代码中删除所有代码(谷歌“nvcc 死代码删除”以获取更多信息)。这可以使代码运行得更快,并将其置于显示驱动程序时间限制之下。
有关解决方法,请参阅最近的 stackoverflow问题与解答
There is no out of bounds memory access in your code (or at least not one which is causing the symptoms you are seeing).
What is happening is that your kernel is being killed by the display driver because it is taking too much time to execute on your display GPU. All CUDA platform display drivers include a time limit for any operation on the GPU. This exists to prevent the display from freezing for a sufficiently long time that either the OS kernel panics or the user panics and thinks the machine has crashed. On the windows platform you are using, the time limit is about 2 seconds.
What has partly mislead you into thinking the source of the problem is array adressing is the commenting out of code makes the problem disappear. But what really happens there is an artifact of compiler optimization. When you comment out a global memory write, the compiler recognizes that the calculations which lead to the value being stored are unused, and it removes all that code from the assembler code it emits (google "nvcc dead code removal" for more information). That has the effect of making the code run much faster and puts it under the display driver time limit.
For workarounds see this recent stackoverflow question and answer