CUDA 上的二维数组

发布于 2024-11-01 01:12:03 字数 543 浏览 1 评论 0原文

我想在 CUDA 中动态分配全局 2D 数组。我怎样才能实现这个目标？

在我的 main 中，我在循环中调用我的 Kernel 。但在调用内核之前，我需要在 GPU 上分配一些内存。内核调用后，会从 GPU 向 CPU 发送一个整数，以通知问题是否已解决。
如果问题没有解决，我不会释放旧内存，因为还需要它，我应该向 GPU 分配新内存并再次调用内核。

显示 sudocode：

int n=0,i=0;
while(n==0)
{
    //allocate 2d memory for MEM[i++] 
    //call kernel(MEM,i)
    // get n from kernel       
}


__global__ void kernerl(Mem,int i)
{
    Mem[0][5]=1;
    Mem[1][0]=Mem[0][5]+23;//can use this when MEM[1] is allocated before kernel call
}

有什么建议吗？谢谢。

原文

I want to dynamically allocate global 2D array in CUDA. How can I achieve this?

In my main I am calling my Kernel in a loop. But before I call the kernel I need to allocate some memory on the GPU. After the kernel call a single integer is send from GPU to CPU to inform whether the problem is solved or not.
If the problem is not solved, I will not free the old memory , since there is a further need of it, and I should allocate new memory to the GPU and call the kernel again.

a sudocode is shown:

int n=0,i=0;
while(n==0)
{
    //allocate 2d memory for MEM[i++] 
    //call kernel(MEM,i)
    // get n from kernel       
}


__global__ void kernerl(Mem,int i)
{
    Mem[0][5]=1;
    Mem[1][0]=Mem[0][5]+23;//can use this when MEM[1] is allocated before kernel call
}

Any suggestions? Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

听，心雨的声音 2024-11-08 01:12:03

两个开头评论 - 在 CUDA 中使用动态分配的 2D 数组是一个坏主意，并且在循环中进行重复的内存分配也不是一个好主意。两者都会造成不必要的性能损失。

对于主机代码，

size_t allocsize = 16000 * sizeof(float);
int n_allocations = 16;
float * dpointer
cudaMalloc((void **)&dpointer, n_allocations * size_t(allocsize));

float * dcurrent = dpointer;
int n = 0;
for(int i=0; ((n==0) && (i<n_allocations)); i++, dcurrent+=allocsize) {

    // whatever you do before the kernel

    kernel <<< gridsize,blocksize >>> (dcurrent,.....);

    // whatever you do after the kernel

}

最好是这样的：在这里，您只调用 cudaMalloc 一次，并将偏移量传递到分配中，这使得循环内的内存分配和管理变得空闲。循环结构还意味着您无法无休止地运行并耗尽所有 GPU 内存。

就二维数组问题本身而言，有两个原因说明它是一个坏主意。首先，具有 N 行的 2D 数组的分配需要 (N+1) 个 cudaMalloc 调用和主机设备内存复制，这是缓慢且丑陋的。其次，在内核代码中，为了获取数据，GPU 必须执行两次全局内存读取，一次用于指针间接读取以获取行地址，然后一次从行中的数据中获取。这比这个替代方案慢得多：

#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
__global__ void kernerl(float * Mem, int lda, ....)
{
    Mem[idx(0,5,lda)]=1; // MemMem[0][5]=1;
}

它使用索引到一维分配中。在 GPU 中，内存事务非常昂贵，但 FLOPS 和 IOPS 却很便宜。单个整数乘加是执行此操作的最有效方法。如果您需要访问先前内核调用的结果，只需将偏移量传递给先前的结果并使用内核内部的两个指针，如下所示：

__global__ void kernel(float *Mem, int lda, int this, int previous)
{
   float * Mem0 = Mem + this;
   float * Mem1 = Mem + previous;

}

高效的分布式内存程序（CUDA实际上是分布式内存编程的一种）开始一段时间后看起来像 Fortran，但这就是你为可移植性、透明度和效率付出的代价。

希望这有帮助。

Two opening comments - using a dynamically allocated 2D array is a bad idea in CUDA, and doing repetitive memory allocations in a loop is also not a good idea. Both incur needless performance penalties.

For the host code, something like this:

size_t allocsize = 16000 * sizeof(float);
int n_allocations = 16;
float * dpointer
cudaMalloc((void **)&dpointer, n_allocations * size_t(allocsize));

float * dcurrent = dpointer;
int n = 0;
for(int i=0; ((n==0) && (i<n_allocations)); i++, dcurrent+=allocsize) {

    // whatever you do before the kernel

    kernel <<< gridsize,blocksize >>> (dcurrent,.....);

    // whatever you do after the kernel

}

is preferable. Here you only call cudaMalloc once, and pass offsets into the allocation, which makes memory allocation and management free inside the loop. The loop structure also means you can't run endlessly and exhaust all the GPU memory.

On the 2D array question itself, there are two reasons why it is a bad idea. Firstly, the allocation requires of a 2D array with N rows requires (N+1) cudaMalloc calls and a host device memory copy, which is slow and ugly. Secondly inside the kernel code, to get at your data, the GPU must do two global memory reads, one for the pointer indirection to get the row address, and then one to fetch from the data from the row. That is much slower than this alternative:

#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
__global__ void kernerl(float * Mem, int lda, ....)
{
    Mem[idx(0,5,lda)]=1; // MemMem[0][5]=1;
}

which uses indexing into a 1D allocation. In the GPU memory transactions are very expensive, but FLOPS and IOPS are cheap. A single integer multiply-add is the most efficient way to do this. If you need to access results from a previous kernel call, just pass the offset to the previous results and use two pointers inside the kernel, something like this:

__global__ void kernel(float *Mem, int lda, int this, int previous)
{
   float * Mem0 = Mem + this;
   float * Mem1 = Mem + previous;

}

Efficient distributed memory programs (and CUDA is really a type of distributed memory programming) start to look like Fortran after a while, but that is the price you pay for portability, transparency and efficiency.

Hope this helped.

回复收藏 0 原文

楠木可依 2024-11-08 01:12:03

好吧，你可以像在 CPU 上一样完成它。

unsigned xSize = 666, ySize = 666;
int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
int **d_ptr = NULL;
cudaMalloc( &d_ptr, xSize );
for(unsigned i = 0; i < xSize; ++i)
{
    cudaMalloc( &h_ptr[i], ySize );
}
cudaMemcpy( &d_ptr, &h_ptr, sizeof(int*) * xSize, cudaMemcpyHostToDevice );
free( h_ptr );

...和 free 类似

int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
cudaMemcpy( &h_ptr, &d_ptr, sizeof(int*) * xSize, cudaMemcpyDeviceToHost );
for(unsigned i = 0; i < xSize; ++i )
{
    cudaFree( h_ptr[i] );
}
cudaFree( d_ptr );
free( h_ptr );

但您应该记住，每次访问该数组的单元格都将涉及访问 GPU 全局内存两次。因此，内存访问速度比一维数组慢两倍。

Well, you can do it just as it would be done on CPU.

unsigned xSize = 666, ySize = 666;
int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
int **d_ptr = NULL;
cudaMalloc( &d_ptr, xSize );
for(unsigned i = 0; i < xSize; ++i)
{
    cudaMalloc( &h_ptr[i], ySize );
}
cudaMemcpy( &d_ptr, &h_ptr, sizeof(int*) * xSize, cudaMemcpyHostToDevice );
free( h_ptr );

...and free similiarly

int **h_ptr = (int**)malloc(sizeof(int*) * xSize);
cudaMemcpy( &h_ptr, &d_ptr, sizeof(int*) * xSize, cudaMemcpyDeviceToHost );
for(unsigned i = 0; i < xSize; ++i )
{
    cudaFree( h_ptr[i] );
}
cudaFree( d_ptr );
free( h_ptr );

But you should keep in mind, that every access to a cell of this array will involve accesing GPU global memory twice. Due to that, memory access will be two times slower than with 1d array.

回复收藏 0 原文