CUDA似乎达到了极限,但那是什么极限呢?

发布于 2024-11-27 12:21:17 字数 3322 浏览 1 评论 0原文

我有一个 CUDA 程序,似乎达到了某些资源的某种限制,但我无法弄清楚该资源是什么。这是内核函数:

__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, 
                        int segmentCount, int* output)
{
    int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int pointCount = segmentCount + 1;

    if(segmentIndex >= segmentCount)
        return;

    int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
    int result = 0;
    if(polylineIndex >= 0)
    {
        float2 p1 = points[segmentIndex];
        float2 p2 = points[segmentIndex+1];
        float2 A = p2;
        float2 a;
        a.x = p2.x - p1.x;
        a.y = p2.y - p1.y;

        for(int i = segmentIndex+2; i < segmentCount; i++)
        {
            int currentPolylineIndex = segmentToPolylineIndexMap[i];

            // if not a different segment within out polyline and
            // not a fake segment
            bool isLegit = (currentPolylineIndex != polylineIndex && 
                currentPolylineIndex >= 0);      

            float2 p3 = points[i];
            float2 p4 = points[i+1];
            float2 B = p4;
            float2 b;
            b.x = p4.x - p3.x;
            b.y = p4.y - p3.y;

            float2 c;
            c.x = B.x - A.x;
            c.y = B.y - A.y;

            float2 b_perp;
            b_perp.x = -b.y;
            b_perp.y = b.x;

            float numerator = dot(b_perp, c);
            float denominator = dot(b_perp, a);
            bool isParallel = (denominator == 0.0);

            float quotient = numerator / denominator;
            float2 intersectionPoint;
            intersectionPoint.x = quotient * a.x + A.x;
            intersectionPoint.y = quotient * a.y + A.y;

            result = result | (isLegit && !isParallel && 
                intersectionPoint.x > min(p1.x, p2.x) && 
                intersectionPoint.x > min(p3.x, p4.x) && 
                intersectionPoint.x < max(p1.x, p2.x) && 
                intersectionPoint.x < max(p3.x, p4.x) && 
                intersectionPoint.y > min(p1.y, p2.y) && 
                intersectionPoint.y > min(p3.y, p4.y) && 
                intersectionPoint.y < max(p1.y, p2.y) && 
                intersectionPoint.y < max(p3.y, p4.y));
        }
    }

    output[segmentIndex] = result;
}

这是执行内核函数的调用:

DoCheck<<<702, 32>>>(
    (float2*)devicePoints, 
    deviceSegmentsToPolylineIndexMap, 
    numSegments, 
    deviceOutput);

参数的大小如下:

  • devicePoints = 22,464 float2s = 179,712 字节
  • deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 字节
  • numSegments = 1 int = 4 字节
  • deviceOutput = 22,463 ints = 89,852 字节

当我执行这个内核,它使显卡崩溃。看来我遇到了某种限制,因为如果我使用 DoCheck<<<300, 32>>>(...); 执行内核,它就会起作用。需要明确的是,参数是相同的,只是块的数量不同。

知道为什么一个会使视频驱动程序崩溃,而另一个却不会?失败的似乎仍在该卡的块数限制之内。

更新 有关我的系统配置的更多信息:

  • 显卡:nVidia 8800GT
  • CUDA 版本:1.1
  • 操作系统:Windows Server 2008 R2

我还在具有以下配置的笔记本电脑上进行了尝试,但得到了相同的结果:

  • 显卡:nVidia Quadro FX 880M
  • CUDA 版本: 1.2
  • 操作系统:Windows 7 64位

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:

__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, 
                        int segmentCount, int* output)
{
    int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int pointCount = segmentCount + 1;

    if(segmentIndex >= segmentCount)
        return;

    int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
    int result = 0;
    if(polylineIndex >= 0)
    {
        float2 p1 = points[segmentIndex];
        float2 p2 = points[segmentIndex+1];
        float2 A = p2;
        float2 a;
        a.x = p2.x - p1.x;
        a.y = p2.y - p1.y;

        for(int i = segmentIndex+2; i < segmentCount; i++)
        {
            int currentPolylineIndex = segmentToPolylineIndexMap[i];

            // if not a different segment within out polyline and
            // not a fake segment
            bool isLegit = (currentPolylineIndex != polylineIndex && 
                currentPolylineIndex >= 0);      

            float2 p3 = points[i];
            float2 p4 = points[i+1];
            float2 B = p4;
            float2 b;
            b.x = p4.x - p3.x;
            b.y = p4.y - p3.y;

            float2 c;
            c.x = B.x - A.x;
            c.y = B.y - A.y;

            float2 b_perp;
            b_perp.x = -b.y;
            b_perp.y = b.x;

            float numerator = dot(b_perp, c);
            float denominator = dot(b_perp, a);
            bool isParallel = (denominator == 0.0);

            float quotient = numerator / denominator;
            float2 intersectionPoint;
            intersectionPoint.x = quotient * a.x + A.x;
            intersectionPoint.y = quotient * a.y + A.y;

            result = result | (isLegit && !isParallel && 
                intersectionPoint.x > min(p1.x, p2.x) && 
                intersectionPoint.x > min(p3.x, p4.x) && 
                intersectionPoint.x < max(p1.x, p2.x) && 
                intersectionPoint.x < max(p3.x, p4.x) && 
                intersectionPoint.y > min(p1.y, p2.y) && 
                intersectionPoint.y > min(p3.y, p4.y) && 
                intersectionPoint.y < max(p1.y, p2.y) && 
                intersectionPoint.y < max(p3.y, p4.y));
        }
    }

    output[segmentIndex] = result;
}

Here is the call to execute the kernel function:

DoCheck<<<702, 32>>>(
    (float2*)devicePoints, 
    deviceSegmentsToPolylineIndexMap, 
    numSegments, 
    deviceOutput);

The sizes of the parameters are as follows:

  • devicePoints = 22,464 float2s = 179,712 bytes
  • deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
  • numSegments = 1 int = 4 bytes
  • deviceOutput = 22,463 ints = 89,852 bytes

When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.

Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.

Update
More information on my system configuration:

  • Video Card: nVidia 8800GT
  • CUDA Version: 1.1
  • OS: Windows Server 2008 R2

I also tried it on a laptop with the following configuration, but got the same results:

  • Video Card: nVidia Quadro FX 880M
  • CUDA Version: 1.2
  • OS: Windows 7 64-bit

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

泅渡 2024-12-04 12:21:17

正在耗尽的资源是时间。在所有当前的 CUDA 平台上,显示驱动程序都包含一个看门狗定时器,它将杀死任何需要超过几秒钟执行的内核。在运行显示器的卡上运行代码受此限制。

在您使用的 WDDM Windows 平台上,存在三种可能的解决方案/解决方法:

  1. 获取 Telsa 卡并使用 TCC 驱动程序,这可以完全消除问题
  2. 尝试修改注册表设置以增加计时器限制(google 查找 TdrDelay 注册表项了解更多信息,但我不是 Windows 用户,不能比这更具体)
  3. 将内核代码修改为“可重入”,并在多个内核启动而不是一次启动中处理数据并行工作负载。内核启动开销并不是那么大,并且在多个内核运行中处理工作负载通常很容易实现,具体取决于您使用的算法。

The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.

On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:

  1. Get a Telsa card and use the TCC driver, which eliminates the problem completely
  2. Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can't be more specific than that)
  3. Modify your kernel code to be "re-entrant" and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn't all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文