传递给 CUDA 的结构中的指针

发布于 2024-09-10 18:18:47 字数 1005 浏览 5 评论 0原文

我已经搞砸了一段时间了，但似乎无法正确处理。我正在尝试将包含数组的对象复制到 CUDA 设备内存中（然后再次复制回来，但当我到达它时我会跨过那座桥）：

struct MyData {
  float *data;
  int dataLen;
}

void copyToGPU() {
  // Create dummy objects to copy
  int N = 10;
  MyData *h_items = new MyData[N];
  for (int i=0; i<N; i++) {
    h_items[i].dataLen = 100;
    h_items[i].data = new float[100];
  }

  // Copy objects to GPU
  MyData *d_items;
  int memSize = N * sizeof(MyData);
  cudaMalloc((void**)&d_items, memSize);
  cudaMemCpy(d_items, h_items, memSize, cudaMemcpyHostToDevice);

  // Run the kernel
  MyFunc<<<100,100>>>(d_items);
}

__global__
static void MyFunc(MyData *data) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i=0; i<data[idx].dataLen; i++) {
    // Do something with data[idx].data[i]
  }
}

当我调用 MyFunc(d_items) 时，我可以访问 data[idx].dataLen很好。但是，data[idx].data 尚未被复制。

我无法使用 copyToGPU 中的 d_items.data 作为 cudaMalloc/cudaMemCpy 操作的目标，因为主机代码无法取消引用设备指针。

该怎么办？

原文

I've been messing around with this for a while now, but can't seem to get it right. I'm trying to copy objects that contain arrays into CUDA device memory (and back again, but I'll cross that bridge when I come to it):

struct MyData {
  float *data;
  int dataLen;
}

void copyToGPU() {
  // Create dummy objects to copy
  int N = 10;
  MyData *h_items = new MyData[N];
  for (int i=0; i<N; i++) {
    h_items[i].dataLen = 100;
    h_items[i].data = new float[100];
  }

  // Copy objects to GPU
  MyData *d_items;
  int memSize = N * sizeof(MyData);
  cudaMalloc((void**)&d_items, memSize);
  cudaMemCpy(d_items, h_items, memSize, cudaMemcpyHostToDevice);

  // Run the kernel
  MyFunc<<<100,100>>>(d_items);
}

__global__
static void MyFunc(MyData *data) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i=0; i<data[idx].dataLen; i++) {
    // Do something with data[idx].data[i]
  }
}

When I call MyFunc(d_items), I can access data[idx].dataLen just fine. However, data[idx].data has not been copied yet.

I can't use d_items.data in copyToGPU as a destination for cudaMalloc/cudaMemCpy operations since the host code cannot dereference a device pointer.

What to do?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

同展鸳鸯锦 2024-09-17 18:18:48

为所有设备分配设备数据
结构，作为单个数组。
将连续数据从主机复制到
图形处理器。
调整GPU指针

示例：

float *d_data;
cudaMalloc((void**)&d_data, N*100*sizeof(float));
for (...) {
    h_items[i].data = i*100 + d_data;
}

allocate device data for all
structures, as a single array.
Copy contiguous data from host to
GPU.
adjust GPU pointers

example:

float *d_data;
cudaMalloc((void**)&d_data, N*100*sizeof(float));
for (...) {
    h_items[i].data = i*100 + d_data;
}

回复收藏 0 原文

离旧人 2024-09-17 18:18:48

您提供的代码仅复制 MyData 结构：主机地址和一个整数。说得更清楚一点，您正在复制指针而不是数据 - 您必须显式复制数据。

如果数据总是相同的LENGTH，那么您可能只想制作一个大数组：

float *d_data;
memSize = N * LENGTH * sizeof(float);
cudaMalloc((void**) &d_data, memSize);

//and a single copy
cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice);

如果它需要与其他数据放在一个结构中，那么：

struct MyData {
  float data[LENGTH];
  int other_data;
}

MyData *d_items;
memSize = N * sizeof(MyData);
cudaMalloc((void**) &d_items, memSize);

//and again a single copy
cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice);

但是，我假设您有这样的数据是各种长度的。一种解决方案是将 LENGTH 设置为最大长度（只是浪费一些空间），然后按照上面的方法进行操作。这可能是最简单的开始方法，然后您可以稍后进行优化。

如果您无法承受丢失的内存和传输时间，那么我将拥有三个数组，一个包含所有数据，一个包含偏移量，一个包含长度，对于主机和设备：

//host memory
float *h_data;
int h_offsets[N], h_lengths[N]; //or allocate these dynamically if necessary
int totalLength;

//device memory
float *d_data;
int *d_offsets, *d_lengths;

/* calculate totalLength, allocate h_data, and fill the three arrays */

//allocate device memory
cudaMalloc((void**) &d_data, totalLength * sizeof(float));
cudaMalloc((void**) &d_ffsets, N * sizeof(int));
cudaMalloc((void**) &d_lengths, N * sizeof(int));

//and now three copies
cudaMemcpy(d_data, h_data, totalLength * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_offsets, h_offsets, N * sizeof(int); cudaMemcpyHostToDevice);
cudaMemcpy(d_lengths, h_lengths, N * sizeof(int); cudaMemcpyHostToDevice);

现在在线程 i< /code> 可以找到从 d_data[d_offsets[i]] 开始、长度为 d_data[d_lengths[i]] 的数据

The code you provide copies MyData structures only: a host address and a integer. To be overly clear, you are copying the pointer and not the data - you have to explicitly copy the data.

If the data is always the same LENGTH, then you probably just want to make one big array:

float *d_data;
memSize = N * LENGTH * sizeof(float);
cudaMalloc((void**) &d_data, memSize);

//and a single copy
cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice);

If it needs to be in a struct with other data, then:

struct MyData {
  float data[LENGTH];
  int other_data;
}

MyData *d_items;
memSize = N * sizeof(MyData);
cudaMalloc((void**) &d_items, memSize);

//and again a single copy
cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice);

But, I am assuming you have data that is a variety of lengths. One solution is to set LENGTH to be the maximum length (and just waste some space), and then do it the same way as above. That might be the easiest way to start, and then you can optimize later.

If you can't afford the lost memory and transfer time, then I would have three arrays, one with all the data and then one with offsets and one with lengths, for both the host and device:

//host memory
float *h_data;
int h_offsets[N], h_lengths[N]; //or allocate these dynamically if necessary
int totalLength;

//device memory
float *d_data;
int *d_offsets, *d_lengths;

/* calculate totalLength, allocate h_data, and fill the three arrays */

//allocate device memory
cudaMalloc((void**) &d_data, totalLength * sizeof(float));
cudaMalloc((void**) &d_ffsets, N * sizeof(int));
cudaMalloc((void**) &d_lengths, N * sizeof(int));

//and now three copies
cudaMemcpy(d_data, h_data, totalLength * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_offsets, h_offsets, N * sizeof(int); cudaMemcpyHostToDevice);
cudaMemcpy(d_lengths, h_lengths, N * sizeof(int); cudaMemcpyHostToDevice);

Now in thread i you can find the data that starts at d_data[d_offsets[i]] and has a length of d_data[d_lengths[i]]

回复收藏 0 原文

~没有更多了~