如何制作 CUDA 直方图内核?
我正在为图片上的直方图编写一个 CUDA 内核,但我不知道如何从内核返回一个数组,并且当其他线程读取它时该数组会发生变化。有什么可能的解决方案吗?
__global__ void Hist(
TColor *dst, //input image
int imageW,
int imageH,
int*data
){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < imageW && iy < imageH)
{
int pixel = get_red(dst[imageW * (iy) + (ix)]);
//this assign specific RED value of image to pixel
data[pixel] ++; // ?? problem statement ...
}
}
@para d_dst:输入图像TColor等于float4。
@para data:直方图大小的数组[255]
extern "C" void
cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);
}
I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?
__global__ void Hist(
TColor *dst, //input image
int imageW,
int imageH,
int*data
){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < imageW && iy < imageH)
{
int pixel = get_red(dst[imageW * (iy) + (ix)]);
//this assign specific RED value of image to pixel
data[pixel] ++; // ?? problem statement ...
}
}
@para d_dst: input image TColor is equals to float4.
@para data: the array for histogram size [255]
extern "C" void
cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您看过 SDK 示例吗? “直方图”示例可在 CUDA SDK(当前版本 3.0)中找到,位于 NVIDIA 开发人员网站,版本 3.1 beta 可供注册开发人员使用)。
示例文档很好地解释了如何处理求和,可以使用 GPU 上的全局内存原子,也可以单独收集每个块的结果,然后进行单独的缩减(在主机或 GPU 上)。
Have you looked at the SDK sample? The "histogram" sample is available in the CUDA SDK (currently version 3.0 on the NVIDIA developer site, version 3.1 beta available for registered developers).
The documentation with the sample explains nicely how to handle your summation, either using global memory atomics on the GPU or by collecting the results for each block separately and then doing a separate reduction (either on the host or the GPU).
当使用 CUDA(或一般的 GPGPU)实现时,直方图并不是特别有效 - 通常您需要在共享内存中生成大量部分直方图,然后对它们求和。您可能需要考虑将这个特定任务保留在 CPU 上。
Histogramming is not particularly efficient when implemented with CUDA (or with GPGPU in general) - typically you need to generate lots of partial histograms in shared memory and then sum them. You might want to consider keeping this particular task on the CPU.
您必须使用原子函数来阻止其他线程使用相同的内存,或者使用部分直方图。无论哪种方式,除非输入图像非常非常大,否则效率都不会那么高。
You will have to either use atomic function to block other thread from using he same memory, or use the partial histogram. Either way it not that efficient unless the input image is very very large.