OpenCL 图像直方图
我正在尝试在 OpenCL 中编写直方图内核来计算 RGBA32F 输入图像的 256 bin R、G 和 B 直方图。我的内核看起来像这样:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP|
CLK_FILTER_NEAREST;
__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
__global int* gOutput, __global int* bOutput)
{
int2 coords = {get_global_id(0), get_global_id(1)};
float4 sample = read_imagef(input, mSampler, coords);
uchar rbin = floor(sample.x * 255.0f);
uchar gbin = floor(sample.y * 255.0f);
uchar bbin = floor(sample.z * 255.0f);
rOutput[rbin]++;
gOutput[gbin]++;
bOutput[bbin]++;
}
当我在 2100 x 894 图像(1,877,400 像素)上运行它时,当我总结每个通道的直方图值时,我往往只会看到记录的总计值在 1,870,000 个左右。每次也是不同的数字。我确实预料到了这一点,因为偶尔两个内核可能会从输出数组中获取相同的值并递增它,从而有效地取消一个递增操作(我假设?)。
1,870,000 输出适用于 {1,1} 工作组大小(如果我不另外指定,这似乎是默认设置的)。如果我强制使用更大的工作组大小(例如 {10,6}),则直方图中的总和会大大减小(与工作组大小的变化成比例)。这对我来说似乎很奇怪,但我猜测会发生什么情况是组中的所有工作项同时递增输出数组值,因此它只算作一个增量?
无论如何,我在规范中读到 OpenCL 没有全局内存同步,只有使用 __local 内存的本地工作组内的同步。 nVidia 的直方图示例将直方图工作负载分解为一堆特定大小的子问题,计算它们的部分直方图,然后将结果合并为单个直方图。这对于任意大小的图像来说似乎不太有效。我想我可以用虚拟值填充图像数据...
作为 OpenCL 的新手,我想我想知道是否有更直接的方法来做到这一点(因为它似乎应该是一个相对简单的 GPGPU 问题)。
谢谢!
I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP|
CLK_FILTER_NEAREST;
__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
__global int* gOutput, __global int* bOutput)
{
int2 coords = {get_global_id(0), get_global_id(1)};
float4 sample = read_imagef(input, mSampler, coords);
uchar rbin = floor(sample.x * 255.0f);
uchar gbin = floor(sample.y * 255.0f);
uchar bbin = floor(sample.z * 255.0f);
rOutput[rbin]++;
gOutput[gbin]++;
bOutput[bbin]++;
}
When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).
The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?
Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...
Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如前所述,您以不同步且非原子的方式写入共享内存。这会导致错误。如果图片足够大,我有一个建议:
将您的工作组拆分为列或行的一维工作组。使用每个内核对列或行的直方图求和,然后使用原子atom_inc对其进行全局求和。这会在私有内存中带来最多的总结,速度更快并减少原子操作。
如果您在二维空间中工作,则可以在图片的某些部分上进行操作。
[编辑:]
我想,我有一个更好的答案:;-)
看看:http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram
他们有一个有趣的实现......
As stated before, you write into a shared memory unsynchronized and non atomic. This leads to errors. If the picture is big enough, I have a suggestion:
Split your work group into a one dimensional one for cols or rows. Use each kernel to sum up the histogram for the col or row and afterwards sum it globally with atomic atom_inc. This brings the most sum ups in private memory which is much faster and reduces atomic ops.
If you work in two dimensions you can do it on parts of the picture.
[EDIT:]
I think, I have a better answer: ;-)
Have a look to: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram
They have an interesting implementation there...
是的,您正在同时从许多工作项写入共享内存,因此如果您不以安全的方式进行更新(或更糟?只是不这样做),您将丢失元素。组大小的增加实际上会增加计算设备的利用率,从而增加发生冲突的可能性。所以你最终会失去更多更新。
但是,您似乎混淆了同步(排序线程执行顺序)和共享内存更新(通常需要原子操作或代码同步和内存屏障,以确保内存更新对其他人可见)同步的线程)。
同步+屏障对于您的情况并不是特别有用(正如您所指出的,无论如何都不可用于全局同步。原因是,2个线程组可能永远不会同时运行,因此尝试同步它们是没有意义的)。它通常在所有线程开始生成公共数据集,然后所有线程开始以不同的访问模式使用该数据集时使用。
在您的情况下,您可以使用原子操作(例如atom_inc,请参阅http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168 )。但是,请注意,更新高度竞争的内存地址(例如,因为您有数千个线程尝试全部写入 256 个整数)可能会产生较差的性能。典型的直方图代码所经历的所有环节都是为了减少直方图数据的争用。
Yes, you're writing to a shared memory from many work-items at the same time, so you will lose elements if you don't do the updates in a safe way (or worse ? Just don't do it). The increase in group size actually increases the utilization of your compute device, which in turn increases the likelihood of conflicts. So you end up losing more updates.
However, you seem to be confusing synchronization (ordering thread execution order) and shared memory updates (which typically require either atomic operations, or code synchronization and memory barriers, to make sure the memory updates are visible to other threads that are synchronized).
the synchronization+barrier is not particularly useful for your case (and as you noted is not available for global synchronization anyways. Reason is, 2 thread-groups may never run concurrently so trying to synchronize them is nonsensical). It's typically used when all threads start working on generating a common data-set, and then all start to consume that data-set with a different access pattern.
In your case, you can use atomic operations (e.g. atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168). However, note that updating a highly contended memory address (say, because you have thousands of threads trying all to write to only 256 ints) is likely to yield poor performance. All the hoops typical histogram code goes through are there to reduce the contention on the histogram data.
您可以
You can check
GPU直方图 - Apple 的示例代码
GPU Histogram - Sample code from Apple