Opencl内核和传统循环
我正在研究OpenCL,但我不了解C/C ++代码和内核代码中传统循环之间的关系。 只是为了清楚这种情况:
所以我的问题是:在传统循环中,我有n
actible as我的边界在内核代码中没有它,但是我有get_global_id(0)
表示我数组的内存范围,这意味着我从0开始,然后迭代直至get_global_id
与数组的最大大小相匹配,在这种情况下,n
?还是有所不同?
因为在另一个示例中,我不知道如何编写相应的内核代码
我希望我的问题很清楚,因为我的英语不太好, 对不起。
在此先感谢您的帮助,如果有问题,请告诉我!
I'm studying OpenCL and I don't understand the relationship between traditional loop in a C/C++ code and kernel code.
Just for be clear a situation like that:
So my question is: In the traditional loops I have n
variable as my boundary while in kernel code I don't have it but I have get_global_id(0)
that indicates the memory scope of my array, this means that I start from 0, and iterate until get_global_id
matches with the maximum size of the array, n
in this case? Or is something different?
Because in this other example I don't know how to write the correspond kernel code
I hope my question is clear because I'm not very well in English, sorry.
Thanks in advance for the help, if there are problems let me know!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
OPENCL内核的编码为前循环的单个迭代,但是所有迭代均与随机顺序并行运行。
i = 0..n-1 ,您将一个接一个地添加向量的每个元素
考虑c ++中的此向量加法示例,其中
for-loop,但作为
内核
关键字和所有向量作为参数的函数:您可能想知道:
n
在哪里?您将n
作为其“全局范围”作为C ++方面的内核,因此内核知道要并行计算的元素i
。因为在OpenCL内核中,每个迭代都并行运行,因此一定没有任何数据依赖性从一个迭代到下一个迭代。否则,您必须使用双缓冲区(仅从一个缓冲区读取,只写入另一个缓冲区)。在您的第二个示例中,
a [i] = b [i-1]+b [i]+b [i+1]
您做到了:仅从b ,仅写入
a
。具有定期边界的实现可以完成,请参见在这里。An OpenCL kernel is coded like a single iteration of a for-loop, but all iterations are run in parallel with random order.
Consider this vector addition example in C++, where for
i=0..N-1
, you add each element of the vectors one after the other:In OpenCL, the vector addition looks like the inside of this for-loop, but as a function with the
kernel
keyword and all vectors as parameters:You might be wondering: Where is
N
? You giveN
to the kernel on the C++ side as its "global range", so the kernel knows how much elementsi
to calculate in parallel.Because in the OpenCL kernel every iteration runs in parallel, there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other). In your second example with
A[i] = B[i-1]+B[i]+B[i+1]
you do exactly that: only read fromB
, only write toA
. The implementation with periodic boundaries can be done branch-less, see here.