复制包含指向 CUDA 设备的指针的结构

发布于 2025-01-06 12:37:53 字数 1397 浏览 2 评论 0原文

我正在开发一个项目,我需要 CUDA 设备对包含指针的结构进行计算。

typedef struct StructA {
    int* arr;
} StructA;

当我为结构分配内存然后将其复制到设备时,它只会复制结构而不复制指针的内容。现在,我正在通过首先分配指针,然后设置主机结构以使用该新指针(驻留在 GPU 上)来解决此问题。下面的代码示例使用上面的结构描述了这种方法:

#define N 10

int main() {

    int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
    StructA *h_a = (StructA*)malloc(sizeof(StructA));
    StructA *d_a;
    int *d_arr;

    // 1. Allocate device struct.
    cudaMalloc((void**) &d_a, sizeof(StructA));

    // 2. Allocate device pointer.
    cudaMalloc((void**) &(d_arr), sizeof(int)*N);

    // 3. Copy pointer content from host to device.
    cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);

    // 4. Point to device pointer in host struct.
    h_a->arr = d_arr;

    // 5. Copy struct from host to device.
    cudaMemcpy(d_a, h_a, sizeof(StructA), cudaMemcpyHostToDevice);

    // 6. Call kernel.
    kernel<<<N,1>>>(d_a);

    // 7. Copy struct from device to host.
    cudaMemcpy(h_a, d_a, sizeof(StructA), cudaMemcpyDeviceToHost);

    // 8. Copy pointer from device to host.
    cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

    // 9. Point to host pointer in host struct.
    h_a->arr = h_arr;
}

我的问题是:这是这样做的方法吗?

这似乎是一项艰巨的工作,我提醒您,这是一个非常简单的结构。如果我的结构体包含大量指针或结构体本身带有指针,则分配和复制的代码将非常广泛且令人困惑。

I'm working on a project where I need my CUDA device to make computations on a struct containing pointers.

typedef struct StructA {
    int* arr;
} StructA;

When I allocate memory for the struct and then copy it to the device, it will only copy the struct and not the content of the pointer. Right now I'm working around this by allocating the pointer first, then set the host struct to use that new pointer (which resides on the GPU). The following code sample describes this approach using the struct from above:

#define N 10

int main() {

    int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
    StructA *h_a = (StructA*)malloc(sizeof(StructA));
    StructA *d_a;
    int *d_arr;

    // 1. Allocate device struct.
    cudaMalloc((void**) &d_a, sizeof(StructA));

    // 2. Allocate device pointer.
    cudaMalloc((void**) &(d_arr), sizeof(int)*N);

    // 3. Copy pointer content from host to device.
    cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);

    // 4. Point to device pointer in host struct.
    h_a->arr = d_arr;

    // 5. Copy struct from host to device.
    cudaMemcpy(d_a, h_a, sizeof(StructA), cudaMemcpyHostToDevice);

    // 6. Call kernel.
    kernel<<<N,1>>>(d_a);

    // 7. Copy struct from device to host.
    cudaMemcpy(h_a, d_a, sizeof(StructA), cudaMemcpyDeviceToHost);

    // 8. Copy pointer from device to host.
    cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

    // 9. Point to host pointer in host struct.
    h_a->arr = h_arr;
}

My question is: Is this the way to do it?

It seems like an awful lot of work, and I remind you that this is a very simple struct. If my struct contained a lot of pointers or structs with pointers themselves, the code for allocation and copy will be quite extensive and confusing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

风透绣罗衣 2025-01-13 12:37:53

编辑: CUDA 6 引入了统一内存,这使得这个“深度复制”问题变得更加容易。有关更多详细信息,请参阅这篇文章


不要忘记您可以将结构按值传递给内核。此代码有效:

// pass struct by value (may not be efficient for complex structures)
__global__ void kernel2(StructA in)
{
    in.arr[threadIdx.x] *= 2;
}

这样做意味着您只需将数组复制到设备,而不是结构:

int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA h_a;
int *d_arr;

// 1. Allocate device array.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);

// 2. Copy array contents from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);

// 3. Point to device pointer in host struct.
h_a.arr = d_arr;

// 4. Call kernel with host struct as argument
kernel2<<<N,1>>>(h_a);

// 5. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

// 6. Point to host pointer in host struct 
//    (or do something else with it if this is not needed)
h_a.arr = h_arr;

Edit: CUDA 6 introduces Unified Memory, which makes this "deep copy" problem a lot easier. See this post for more details.


Don't forget that you can pass structures by value to kernels. This code works:

// pass struct by value (may not be efficient for complex structures)
__global__ void kernel2(StructA in)
{
    in.arr[threadIdx.x] *= 2;
}

Doing so means you only have to copy the array to the device, not the structure:

int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA h_a;
int *d_arr;

// 1. Allocate device array.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);

// 2. Copy array contents from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);

// 3. Point to device pointer in host struct.
h_a.arr = d_arr;

// 4. Call kernel with host struct as argument
kernel2<<<N,1>>>(h_a);

// 5. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

// 6. Point to host pointer in host struct 
//    (or do something else with it if this is not needed)
h_a.arr = h_arr;
不可一世的女人 2025-01-13 12:37:53

正如 Mark Harris 所指出的,结构可以通过值传递给 CUDA 内核。然而,应该注意设置适当的析构函数,因为析构函数是在内核退出时调用的。

考虑以下

#include <stdio.h>

#include "Utilities.cuh"

#define NUMBLOCKS  512
#define NUMTHREADS 512 * 2

/***************/
/* TEST STRUCT */
/***************/
struct Lock {

    int *d_state;

    // --- Constructor
    Lock(void) {
        int h_state = 0;                                        // --- Host side lock state initializer
        gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int)));  // --- Allocate device side lock state
        gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
    }

    // --- Destructor (wrong version)
    //~Lock(void) { 
    //  printf("Calling destructor\n");
    //  gpuErrchk(cudaFree(d_state)); 
    //}

    // --- Destructor (correct version)
//  __host__ __device__ ~Lock(void) {
//#if !defined(__CUDACC__)
//      gpuErrchk(cudaFree(d_state));
//#else
//
//#endif
//  }

    // --- Lock function
    __device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }

    // --- Unlock function
    __device__ void unlock(void) { atomicExch(d_state, 0); }
};

/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCounterLocked(Lock lock, int *nblocks) {

    if (threadIdx.x == 0) {
        lock.lock();
        *nblocks = *nblocks + 1;
        lock.unlock();
    }
}

/********/
/* MAIN */
/********/
int main(){

    int h_counting, *d_counting;
    Lock lock;

    gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));

    // --- Locked case
    h_counting = 0;
    gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));

    blockCounterLocked << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
    printf("Counting in the locked case: %i\n", h_counting);

    gpuErrchk(cudaFree(d_counting));
}

带有未注释的析构函数的示例(不要过多关注代码实际执行的操作)。如果运行该代码,您将收到以下输出

Calling destructor
Counting in the locked case: 512
Calling destructor
GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37

然后对析构函数有两次调用,一次在内核出口,一次在主出口。该错误消息与以下事实有关:如果 d_state 指向的内存位置在内核出口处被释放,则它们无法在主出口处再被释放。因此,主机和设备执行的析构函数必须不同。这是通过上面代码中带注释的析构函数来完成的。

As pointed out by Mark Harris, structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel.

Consider the following example

#include <stdio.h>

#include "Utilities.cuh"

#define NUMBLOCKS  512
#define NUMTHREADS 512 * 2

/***************/
/* TEST STRUCT */
/***************/
struct Lock {

    int *d_state;

    // --- Constructor
    Lock(void) {
        int h_state = 0;                                        // --- Host side lock state initializer
        gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int)));  // --- Allocate device side lock state
        gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
    }

    // --- Destructor (wrong version)
    //~Lock(void) { 
    //  printf("Calling destructor\n");
    //  gpuErrchk(cudaFree(d_state)); 
    //}

    // --- Destructor (correct version)
//  __host__ __device__ ~Lock(void) {
//#if !defined(__CUDACC__)
//      gpuErrchk(cudaFree(d_state));
//#else
//
//#endif
//  }

    // --- Lock function
    __device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }

    // --- Unlock function
    __device__ void unlock(void) { atomicExch(d_state, 0); }
};

/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCounterLocked(Lock lock, int *nblocks) {

    if (threadIdx.x == 0) {
        lock.lock();
        *nblocks = *nblocks + 1;
        lock.unlock();
    }
}

/********/
/* MAIN */
/********/
int main(){

    int h_counting, *d_counting;
    Lock lock;

    gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));

    // --- Locked case
    h_counting = 0;
    gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));

    blockCounterLocked << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
    printf("Counting in the locked case: %i\n", h_counting);

    gpuErrchk(cudaFree(d_counting));
}

with the uncommented destructor (do not pay too much attention on what the code actually does). If you run that code, you will receive the following output

Calling destructor
Counting in the locked case: 512
Calling destructor
GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37

There are then two calls to the destructor, once at the kernel exit and once at the main exit. The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit. Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.

一个人的夜不怕黑 2025-01-13 12:37:53

数组结构是 cuda 中的噩梦。您必须将每个指针复制到设备可以使用的新结构。也许您可以使用结构数组?如果不是我发现的唯一方法就是像你一样攻击它,这绝不是漂亮的。

编辑:
因为我无法对顶帖发表评论:步骤 9 是多余的,因为你可以将步骤 8 和 9 更改为

// 8. Copy pointer from device to host.
cudaMemcpy(h->arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

struct of arrays is a nightmare in cuda. You will have to copy each of the pointer to a new struct which the device can use. Maybe you instead could use an array of structs? If not the only way I have found is to attack it the way you do, which is in no way pretty.

EDIT:
since I can't give comments on the top post: Step 9 is redundant, since you can change step 8 and 9 into

// 8. Copy pointer from device to host.
cudaMemcpy(h->arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文