cuda算法结构
我想了解使用 CUDA 在 GPU 上执行以下操作的一般方法。
我有一个可能看起来像这样的算法:
void DoStuff(int[,] inputMatrix, int[,] outputMatrix)
{
forloop {
forloop {
if (something) {
DoStuffA(inputMatrix,a,b,c,outputMatrix)
}
else {
DoStuffB(inputMatrix,a,b,c,outputMatrix)
}
}
}
}
DoStuffA 和 DoStuffB 是简单的可并行化函数(例如,执行矩阵行操作),CUDA 示例中有很多。
我想做的是知道如何将主要算法“DoStuff”放到GPU上,然后在需要时调用DoStuffA和DoStuffB(并且它们并行执行)。即外部循环部分是单线程的,但内部调用不是。
我见过的例子似乎从一开始就是多线程的。我假设有一种方法可以从外部调用基于单个 GPU 的方法并让它自己控制所有并行位?
I would like to understand the general way of doing the following on a GPU using CUDA.
I have an algorithm that might look something like this:
void DoStuff(int[,] inputMatrix, int[,] outputMatrix)
{
forloop {
forloop {
if (something) {
DoStuffA(inputMatrix,a,b,c,outputMatrix)
}
else {
DoStuffB(inputMatrix,a,b,c,outputMatrix)
}
}
}
}
DoStuffA and DoStuffB are simple paralleizable functions (e.g. doing a matrix row operation) that the CUDA examples have plenty of.
What I want to do is to know how to put the main algorithm "DoStuff" onto the GPU and then call DoStuffA and DoStuffB as and when I need to (and they execute in parallel). i.e. the outer loop part is single threaded, but the inner calls are not.
The examples I have seen seem to be multithreaded from the get-go. I assume there is a way to just call a single GPU based method from the outside world and have it control all of the parallel bits by itself?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这取决于 for 循环中数据相互之间的关系,但粗略地说,我会将
这样,最大的问题就是调用每个内核的开销。如果您的输入数据很大,那么情况不会那么糟糕。
It depends on how the data inter relates to each other in the for loops, but roughly I would
This way, the biggest problem is overhead for calling each kernel. If your input data is large then it won't be so bad.