基本阵列复制opencl gpu
有人可以帮助我找出此CPU的C代码,以供GPU的内核代码,
int a[N], b[N];
b[0] = a[0];
b[N] = a[N];
for (i=1; i<N-1; i++)
b[i]= a[i-1] + a[i] + a[i+1];
想过是这样写的,但是我想找到一个更好的性能解决方案
__kernel void adjacentCopy(__global double *a, __global double *b, const unsigned int n) {
int gid = get_global_id(0);
if (gid < N)
b[gid] = a[gid-1]+a[gid]+a[gid+1];
}
// and than execute the two base case into the host
我 本地记忆,还将两个极端情况带回内核,而无需添加分支差异
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
本质上是
内核
是 loop的,每个迭代都在同时运行。确切的执行顺序是随机的,因此从一个迭代到下一个迭代的数据依赖性不得有任何数据依赖性。否则,您必须使用双缓冲区(仅从一个缓冲区读取,只写入另一个缓冲区)。
在您的情况下,内核将读取:
在极端情况下
gid == 0 || gid == n-1
,在这样的计算网格上,您通常使用周期性的边界条件。然后,内核将变得无分支,看起来像这样:现在,用于
local
内存优化:没有它,对于每个线程,您会从慢a 的3个相邻值>全局内存。从理论上讲,您只能从
全局
内存中加载一个元素,并使用快速local
内存将数据共享工作组中。但是gid == 0 || gid == n-1
的两个线程将必须从global
内存加载2个值,引入分支,这可能会杀死任何性能的潜在增益。在这种情况下,添加的并发症且没有显着提高,使local
内存优化成为不利的选择。这就是内核的样子:A
kernel
in essence is afor
-loop, of which every iteration runs in parallel. The exact order of execution is random, so there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other).In your case, the kernel would read:
With the extreme cases
gid==0||gid==N-1
, on such a computational grid you typically use periodic boundary conditions. Then the kernel would become branchless and look like this:Now for the
local
memory optimization: Without it, for every thread, you read 3 neighboring values ofa
from slowglobal
memory. In theory, you could only load one element per thread fromglobal
memory and use fastlocal
memory to share the data within the workgroup. But the two threads atgid==0||gid==N-1
will have to load 2 values fromglobal
memory, introducing branching, and this will likely kill any potential gain in performance. The added complication, together with no significant gains in performance, make thelocal
memory optimization an unfavourable choice in this case. This is how the kernel would look like: