比较 2D 数组上的 Matlab 与 CUDA 相关性和简化

发布于 2024-09-13 06:44:21 字数 1700 浏览 17 评论 0原文

我正在尝试比较使用 FFT 与使用加窗方法的互相关。

我的 Matlab 代码是：

isize = 20;
n = 7;
for i = 1:n %%7x7 xcorr
  for j = 1:n
    xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array
  end
end

类似 CUDA 内核：

__global__ void xc_corr(double* in_im, double* ref_im, int pix3, int isize, int n, double* out1, double* temp1, double* sum_temp1)
{

    int p = blockIdx.x * blockDim.x + threadIdx.x;
    int q = 0;
    int i = 0;
    int j = 0;
    int summ = 0;

    for(i = 0; i < n; ++i)
    {
        for(j = 0; j < n; ++j)
        {
            summ  = 0; //force update
            for(p = 0; p < pix1; ++p)
            {
                for(q = 0; q < pix1; ++q)
                {
                    temp1[((i*n+j)*pix1*pix1)+p*pix1+q] = in_im[p*pix1+q] * ref_im[(p+i)*pix1+(q+j)];               
                    sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q] += temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                    out1[i*n+j] = sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                }
            }       
        }
    }

我在内核中将其称为

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);
cudaThreadSynchronize();

不知何故，当我对输出文件进行比较时，我发现 CUDA 内核仅计算前 400 个元素。

编写这个内核的正确方法是什么？

另外，在我的内核中声明 i,j 有什么区别，如下所示？

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

原文

I am trying to compare cross-correlation using FFT vs using windowing method.

My Matlab code is:

isize = 20;
n = 7;
for i = 1:n %%7x7 xcorr
  for j = 1:n
    xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array
  end
end

similar CUDA kernel:

__global__ void xc_corr(double* in_im, double* ref_im, int pix3, int isize, int n, double* out1, double* temp1, double* sum_temp1)
{

    int p = blockIdx.x * blockDim.x + threadIdx.x;
    int q = 0;
    int i = 0;
    int j = 0;
    int summ = 0;

    for(i = 0; i < n; ++i)
    {
        for(j = 0; j < n; ++j)
        {
            summ  = 0; //force update
            for(p = 0; p < pix1; ++p)
            {
                for(q = 0; q < pix1; ++q)
                {
                    temp1[((i*n+j)*pix1*pix1)+p*pix1+q] = in_im[p*pix1+q] * ref_im[(p+i)*pix1+(q+j)];               
                    sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q] += temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                    out1[i*n+j] = sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                }
            }       
        }
    }

I have called this in my kernel as

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);
cudaThreadSynchronize();

Somehow, when I do a diff on the output file, I see that the CUDA kernel computes for only the first 400 elements.

What is the correct way to write this kernel??

Also, what is the difference in declaring i,j as shown below in my kernel??

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

傲娇萝莉攻 2024-09-20 06:44:21

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);

意味着每个块启动 64 个线程，并且线程块的数量比处理 pix3 元素所需的数量多 1 个。如果 pix3 确实是 400，那么您将处理 400 个元素，因为您将启动 7 个线程块，每个线程块执行 64 个点，其中 48 个线程块不执行任何操作。

我不太确定这里出了什么问题。

此外，

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

blocksize 和 nblocks 实际上被转换为 dim3 向量，因此它们具有 (x,y,z) 值。如果您使用 <<64,7>> 调用内核，那么

dim3 blocksize(64,1,1);
dim3 nblocks(7,1,1);
kernel<<blocksize,nblocks>>();

对于每个内核调用，它都会被转换为这样，blockIdx 有 3 个组件，线程 id x、y 和 z，对应于 3d 网格您所在的线程数。在您的情况下，由于您只有一个 x 组件，因此 blockIdx.y 和 threadIdx.y 无论如何都将为 1。所以本质上来说，它们是没有用的。

老实说，您似乎应该重新阅读用户手册中的 CUDA 基础知识，因为您似乎缺少很多基础知识。在这里解释它并不经济因为它都写下来了您可以在此处找到一份不错的文档。如果您只是想使用 cuda 进行更快的 FFT，您可以在 Nvidia 的 CUDA 区域下载并安装许多库，如果您不关心学习 CUDA，这些库可以为您做到这一点。

祝你好运，伙计。

附言。您不需要在每个内核之后调用 cudaThreadSynchronize ；）

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);

means that you are launching 64 threads per block, and number of threadblocks equal to 1 more than needed to process pix3 elements. If pix3 is indeed 400, then you are processing 400 elements because you'll launch 7 threadblocks, each of which does 64 points, and 48 of which does nothing.

I'm not too sure what's the problem here.

Also,

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

blocksize and nblocks are actually converted to dim3 vectors, so that they have a (x,y,z) value. If you call a kernel with <<64,7>>, that'll be translated to

dim3 blocksize(64,1,1);
dim3 nblocks(7,1,1);
kernel<<blocksize,nblocks>>();

so for each kernel call, the blockIdx has 3 components, the thread id x, y, and z, corresponding to the 3d grid of threads you are in. In your case, since you only have an x component, blockIdx.y and threadIdx.y are all going to be 1 no matter what. So essentially, they're useless.

Honestly, you seem like you should go reread the basics of CUDA from the user manual, because there are a lot of basics you seem to be missing. Explaining it here wouldn't be economical since it's all written down in a nice documentation you can get here. And if you just want to have a faster FFT with cuda, there's a number of libraries you can just download and install on Nvidia's CUDA zone that will do it for you if you don't care about learning CUDA.

Best of luck mate.

PS. you don't need to call cudaThreadSynchronize after each kernel ;)

回复收藏 0 原文

~没有更多了~