OpenCL内核优化

发布于 2024-11-05 00:43:31 字数 762 浏览 1 评论 0原文

我在内核中使用大量无符号字符工作,我使用 clCreateBuffer 创建内存对象。然后我通过 clEnqueueWriteBuffer 将一块无符号字符复制到该内存对象中。然后我在循环中调用从该内存对象读取的内核,执行一些逻辑并将新数据写入同一位置(在此循环中我不调用 clEnqueueWriteBuffer 或 clEnqueueReadBuffer)。这是内核代码:

__kernel void test(__global unsigned char *in, unsigned int offset) {
    int grId = get_group_id(0);
    unsigned char msg[1024];
    offset *= grId;

    // Copy from global to private memory
    size_t i;
    for (i = 0; i < 1024; i++)
        msg[i] = in[ offset + i ];

    // Make some computation here, not complicated logic    

    // Copy from private to global memory
    for (i = 0; i < 1024; i++)
        in[ offset + i ] = msg[i];
}

当循环完成时(循环运行 cca 1000 次),然后我通过 clEnqueueReadBuffer 从内存对象读取结果。

可以优化这段代码吗?

I work in kernel with large array of unsigned characters, I create memory object with clCreateBuffer. Than I copied through clEnqueueWriteBuffer a chunk of unsigned chars to this memory object. And than I call in cycle the kernel which read from this memory object, do some logic and write new data to the same place (I don't call clEnqueueWriteBuffer or clEnqueueReadBuffer in this cycle). Here is the kernel code:

__kernel void test(__global unsigned char *in, unsigned int offset) {
    int grId = get_group_id(0);
    unsigned char msg[1024];
    offset *= grId;

    // Copy from global to private memory
    size_t i;
    for (i = 0; i < 1024; i++)
        msg[i] = in[ offset + i ];

    // Make some computation here, not complicated logic    

    // Copy from private to global memory
    for (i = 0; i < 1024; i++)
        in[ offset + i ] = msg[i];
}

When the cycle is done (the cycle run cca 1000 times) then I read result from memory object through clEnqueueReadBuffer.

It is possible to optimize this code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

剪不断理还乱 2024-11-12 00:43:31

一些建议:

  • 在内核开头执行一个 in += get_group_id(0) * offset
  • 一次读取 4 个字符(适用于 uchar4 或 uint)。
  • 如果可能的话,也一次处理 4 个字符。
  • 每个线程中有 1K 私有数组,工作组大小和占用率将受到严重限制,运行更多线程处理更少的字符可能会更有效。
  • 看起来每个组中的所有线程都会处理完全相同的数据;这可能不是你想要的。

Some suggestions:

  • do a single in += get_group_id(0) * offset at the beginning of the kernel.
  • read 4 chars at a time (work on uchar4 or uint).
  • if possible, process 4 chars at a time too.
  • with a 1K private array in each thread, workgroup size and occupancy will be severely limited, it may be more efficient to run more threads processing less chars.
  • it seems that all threads in each group will process exactly the same data ; it may not be what you had in mind.
仅一夜美梦 2024-11-12 00:43:31

您可以尝试矢量版本(uchar8 而不是 uchar),但编译器可能会以这种方式优化它。
最重要的是始终分析您的代码并进行实验。

编辑

似乎现在甚至支持 uchar16:
http://www.khronos.org/registry /cl/sdk/1.1/docs/man/xhtml/vectorDataTypes.html

You could try the vector version (uchar8 instead of uchar) but the compiler may optimize it this way anyway.
Most important profile your code all the time and experiment.

edit

Seems even uchar16 is supported now:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/vectorDataTypes.html

一江春梦 2024-11-12 00:43:31

首先想到的是,展开循环可以帮助您跳过条件评估。您可以使用 这个编译指示使其更容易。

在 Nvidia 芯片上使用共享内存也会有很大帮助(如果您当前的本地内存默认不使用共享内存)

Something that crosses the mind firstly, is that unrolling your loop can help you skip the condition evaluation. You can use this pragma to make it easier.

Using shared memories on Nvidia chips also could greatly help (if your current local mem is not using the shared memory by default)

清欢 2024-11-12 00:43:31

为了进行优化,您需要解释您进行了哪些计算。通过将计算分组到工作组并让它们在本地内存上工作,可以获得最大的性能优势。您需要非常注意私有内存(最小)和本地内存(小)的大小。

你的内核被调用的频率是多少?所有内核都使用相同的数据吗?人们可以想象一个本地内存缓冲区,其中工作组中的所有线程将一部分数据读取到本地内存中,然后共享数据。您需要注意同步。

我建议查看 SDK 供应商的示例。我只知道nVidia SDK。那里的示例非常复杂,但读起来非常有趣。

对诸如 float4 之类的向量类型的更改应适用于 ATI 主板。据说 nVidia 在标量和内部编译器优化方面工作得最好。这是稍后使用分析器进行微调的内容。您可以通过内存优化来大幅提高性能。

For optimization you need to explain what kind of calculations you do. The most benefit for performance can be gotten by grouping your calculations into work groups and to let them work on local memory. You need to pay a lot of attention to the size of your private memory (smallest) and local memory (small).

How often is your kernel called? Do all kernels use the same data? One could think of a local memory buffer where all threads in a work group read a part of the data into the local memory and share the data afterwards. You need to pay a little attention for synchronization.

I suggest having a look into the samples of the SDK vendor. I only know the nVidia SDK. The samples there are quite complex, but very interesting to read.

The change to vector types like float4shall be suitable for ATI boards. nVidia is said to be working best with scalars and internal compiler optimization. This is something for fine tuning later with a profiler. You can gain magnitudes of performance by memory optimization.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文