OpenCL内核优化
我在内核中使用大量无符号字符工作,我使用 clCreateBuffer 创建内存对象。然后我通过 clEnqueueWriteBuffer 将一块无符号字符复制到该内存对象中。然后我在循环中调用从该内存对象读取的内核,执行一些逻辑并将新数据写入同一位置(在此循环中我不调用 clEnqueueWriteBuffer 或 clEnqueueReadBuffer)。这是内核代码:
__kernel void test(__global unsigned char *in, unsigned int offset) {
int grId = get_group_id(0);
unsigned char msg[1024];
offset *= grId;
// Copy from global to private memory
size_t i;
for (i = 0; i < 1024; i++)
msg[i] = in[ offset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < 1024; i++)
in[ offset + i ] = msg[i];
}
当循环完成时(循环运行 cca 1000 次),然后我通过 clEnqueueReadBuffer 从内存对象读取结果。
可以优化这段代码吗?
I work in kernel with large array of unsigned characters, I create memory object with clCreateBuffer. Than I copied through clEnqueueWriteBuffer a chunk of unsigned chars to this memory object. And than I call in cycle the kernel which read from this memory object, do some logic and write new data to the same place (I don't call clEnqueueWriteBuffer or clEnqueueReadBuffer in this cycle). Here is the kernel code:
__kernel void test(__global unsigned char *in, unsigned int offset) {
int grId = get_group_id(0);
unsigned char msg[1024];
offset *= grId;
// Copy from global to private memory
size_t i;
for (i = 0; i < 1024; i++)
msg[i] = in[ offset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < 1024; i++)
in[ offset + i ] = msg[i];
}
When the cycle is done (the cycle run cca 1000 times) then I read result from memory object through clEnqueueReadBuffer.
It is possible to optimize this code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
一些建议:
in += get_group_id(0) * offset
。Some suggestions:
in += get_group_id(0) * offset
at the beginning of the kernel.您可以尝试矢量版本(uchar8 而不是 uchar),但编译器可能会以这种方式优化它。
最重要的是始终分析您的代码并进行实验。
编辑
似乎现在甚至支持 uchar16:
http://www.khronos.org/registry /cl/sdk/1.1/docs/man/xhtml/vectorDataTypes.html
You could try the vector version (uchar8 instead of uchar) but the compiler may optimize it this way anyway.
Most important profile your code all the time and experiment.
edit
Seems even uchar16 is supported now:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/vectorDataTypes.html
首先想到的是,展开循环可以帮助您跳过条件评估。您可以使用 这个编译指示使其更容易。
在 Nvidia 芯片上使用共享内存也会有很大帮助(如果您当前的本地内存默认不使用共享内存)
Something that crosses the mind firstly, is that unrolling your loop can help you skip the condition evaluation. You can use this pragma to make it easier.
Using shared memories on Nvidia chips also could greatly help (if your current local mem is not using the shared memory by default)
为了进行优化,您需要解释您进行了哪些计算。通过将计算分组到工作组并让它们在本地内存上工作,可以获得最大的性能优势。您需要非常注意私有内存(最小)和本地内存(小)的大小。
你的内核被调用的频率是多少?所有内核都使用相同的数据吗?人们可以想象一个本地内存缓冲区,其中工作组中的所有线程将一部分数据读取到本地内存中,然后共享数据。您需要注意同步。
我建议查看 SDK 供应商的示例。我只知道nVidia SDK。那里的示例非常复杂,但读起来非常有趣。
对诸如 float4 之类的向量类型的更改应适用于 ATI 主板。据说 nVidia 在标量和内部编译器优化方面工作得最好。这是稍后使用分析器进行微调的内容。您可以通过内存优化来大幅提高性能。
For optimization you need to explain what kind of calculations you do. The most benefit for performance can be gotten by grouping your calculations into work groups and to let them work on local memory. You need to pay a lot of attention to the size of your private memory (smallest) and local memory (small).
How often is your kernel called? Do all kernels use the same data? One could think of a local memory buffer where all threads in a work group read a part of the data into the local memory and share the data afterwards. You need to pay a little attention for synchronization.
I suggest having a look into the samples of the SDK vendor. I only know the nVidia SDK. The samples there are quite complex, but very interesting to read.
The change to vector types like float4shall be suitable for ATI boards. nVidia is said to be working best with scalars and internal compiler optimization. This is something for fine tuning later with a profiler. You can gain magnitudes of performance by memory optimization.