基本阵列复制opencl gpu

发布于 2025-02-10 18:22:58 字数 511 浏览 2 评论 0 原文

有人可以帮助我找出此CPU的C代码，以供GPU的内核代码，

int a[N], b[N];
b[0] = a[0];
b[N] = a[N];

for (i=1; i<N-1; i++) 
    b[i]= a[i-1] + a[i] + a[i+1];

想过是这样写的，但是我想找到一个更好的性能解决方案

__kernel void adjacentCopy(__global double *a, __global double *b, const unsigned int n) {

    int gid = get_global_id(0);

    if (gid < N) 
        b[gid] = a[gid-1]+a[gid]+a[gid+1];
    
}
    // and than execute the two base case into the host

我本地记忆，还将两个极端情况带回内核，而无需添加分支差异

原文

Can someone help me to figure out of to traslate this C code for CPU, to kernel code for GPU

int a[N], b[N];
b[0] = a[0];
b[N] = a[N];

for (i=1; i<N-1; i++) 
    b[i]= a[i-1] + a[i] + a[i+1];

I thought about writing it this way, but I would like to find a better performing solution

__kernel void adjacentCopy(__global double *a, __global double *b, const unsigned int n) {

    int gid = get_global_id(0);

    if (gid < N) 
        b[gid] = a[gid-1]+a[gid]+a[gid+1];
    
}
    // and than execute the two base case into the host

Anyone can suggest a way to organize the code to use local memory, and also bringing back the two extremes cases into the kernel, without adding branches divergence

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

世态炎凉 2025-02-17 18:22:58

本质上是内核是 loop的，每个迭代都在同时运行。确切的执行顺序是随机的，因此从一个迭代到下一个迭代的数据依赖性不得有任何数据依赖性。否则，您必须使用双缓冲区（仅从一个缓冲区读取，只写入另一个缓冲区）。

在您的情况下，内核将读取：

__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    if(gid==0||gid==N-1) return; // guard clause: do not execute the first and last element
    b[gid] = a[gid-1]+a[gid]+a[gid+1]; // double buffers to resolve data dependencies: only read from a and only write to b
}

在极端情况下 gid == 0 || gid == n-1 ，在这样的计算网格上，您通常使用周期性的边界条件。然后，内核将变得无分支，看起来像这样：

__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    b[gid] = a[(gid+N-1)%N]+a[gid]+a[(gid+1)%N]; // periodic boundaries with modulo; in "(gid+N-1)" the "+N" ensures that the argument of the modulo operator always is positive
}

现在，用于 local 内存优化：没有它，对于每个线程，您会从慢 a 的3个相邻值>全局内存。从理论上讲，您只能从全局内存中加载一个元素，并使用快速 local 内存将数据共享工作组中。但是 gid == 0 || gid == n-1 的两个线程将必须从 global 内存加载2个值，引入分支，这可能会杀死任何性能的潜在增益。在这种情况下，添加的并发症且没有显着提高，使 local 内存优化成为不利的选择。这就是内核的样子：

#define def_workgroup_size 128 // set this to the size of the workgroup
__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    __local double cached_a[def_workgroup_size+2]; // as large as the workgroup, plus neighbors on the left and right sides of the workgroup
    if(lid==0) cached_a[lid] = a[(gid+N-1)%N]; // first thread in workgroup also has to load left neighbor
    cached_a[lid+1] = a[gid];
    if(lid==def_workgroup_size-1) cached_a[lid+1] = a[(gid+1)%N]; // last thread in workgroup also has to load right neighbor
    barrier(CLK_LOCAL_MEM_FENCE); // barrier to make sure cached_a is entirely filled up
    b[gid] = cached_a[lid]+cached_a[lid+1]+cached_a[lid+2]; // read 3 values from local memory
}

A kernel in essence is a for-loop, of which every iteration runs in parallel. The exact order of execution is random, so there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other).

In your case, the kernel would read:

__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    if(gid==0||gid==N-1) return; // guard clause: do not execute the first and last element
    b[gid] = a[gid-1]+a[gid]+a[gid+1]; // double buffers to resolve data dependencies: only read from a and only write to b
}

With the extreme cases gid==0||gid==N-1, on such a computational grid you typically use periodic boundary conditions. Then the kernel would become branchless and look like this:

__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    b[gid] = a[(gid+N-1)%N]+a[gid]+a[(gid+1)%N]; // periodic boundaries with modulo; in "(gid+N-1)" the "+N" ensures that the argument of the modulo operator always is positive
}

Now for the local memory optimization: Without it, for every thread, you read 3 neighboring values of a from slow global memory. In theory, you could only load one element per thread from global memory and use fast local memory to share the data within the workgroup. But the two threads at gid==0||gid==N-1 will have to load 2 values from global memory, introducing branching, and this will likely kill any potential gain in performance. The added complication, together with no significant gains in performance, make the local memory optimization an unfavourable choice in this case. This is how the kernel would look like:

#define def_workgroup_size 128 // set this to the size of the workgroup
__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    __local double cached_a[def_workgroup_size+2]; // as large as the workgroup, plus neighbors on the left and right sides of the workgroup
    if(lid==0) cached_a[lid] = a[(gid+N-1)%N]; // first thread in workgroup also has to load left neighbor
    cached_a[lid+1] = a[gid];
    if(lid==def_workgroup_size-1) cached_a[lid+1] = a[(gid+1)%N]; // last thread in workgroup also has to load right neighbor
    barrier(CLK_LOCAL_MEM_FENCE); // barrier to make sure cached_a is entirely filled up
    b[gid] = cached_a[lid]+cached_a[lid+1]+cached_a[lid+2]; // read 3 values from local memory
}

回复收藏 0 原文

~没有更多了~