cuda算法结构

发布于 2024-11-30 09:05:51 字数 593 浏览 2 评论 0原文

我想了解使用 CUDA 在 GPU 上执行以下操作的一般方法。

我有一个可能看起来像这样的算法：

void DoStuff(int[,] inputMatrix, int[,] outputMatrix)
{
   forloop {
     forloop {
         if (something) {
                DoStuffA(inputMatrix,a,b,c,outputMatrix)
         }
         else {
               DoStuffB(inputMatrix,a,b,c,outputMatrix)
         }
     }
   }
}

DoStuffA 和 DoStuffB 是简单的可并行化函数（例如，执行矩阵行操作），CUDA 示例中有很多。

我想做的是知道如何将主要算法“DoStuff”放到GPU上，然后在需要时调用DoStuffA和DoStuffB（并且它们并行执行）。即外部循环部分是单线程的，但内部调用不是。

我见过的例子似乎从一开始就是多线程的。我假设有一种方法可以从外部调用基于单个 GPU 的方法并让它自己控制所有并行位？

原文

I would like to understand the general way of doing the following on a GPU using CUDA.

I have an algorithm that might look something like this:

void DoStuff(int[,] inputMatrix, int[,] outputMatrix)
{
   forloop {
     forloop {
         if (something) {
                DoStuffA(inputMatrix,a,b,c,outputMatrix)
         }
         else {
               DoStuffB(inputMatrix,a,b,c,outputMatrix)
         }
     }
   }
}

DoStuffA and DoStuffB are simple paralleizable functions (e.g. doing a matrix row operation) that the CUDA examples have plenty of.

What I want to do is to know how to put the main algorithm "DoStuff" onto the GPU and then call DoStuffA and DoStuffB as and when I need to (and they execute in parallel). i.e. the outer loop part is single threaded, but the inner calls are not.

The examples I have seen seem to be multithreaded from the get-go. I assume there is a way to just call a single GPU based method from the outside world and have it control all of the parallel bits by itself?

分享到QQ

分享到微博