CUDA设备指针操作
我在 CUDA C 中使用:
float *devptr;
//...
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
来分配和填充数组。 现在我正在尝试运行 cuda 内核,例如:
__global__ void kernelname(float *ptr)
{
//...
}
在该数组中但具有偏移值。 在 C/C++ 中,它会是这样的:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
但是,这似乎不起作用。
有没有办法做到这一点,而无需在单独的参数中将偏移值发送到内核并在内核代码中使用该偏移量? 关于如何做到这一点有什么想法吗?
I've used:
float *devptr;
//...
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
{
//...
}
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
指针运算在 CUDA 中运行得很好。您可以在主机代码中向 CUDA 指针添加偏移量,它将正常工作(记住偏移量不是字节偏移量,而是普通字或元素偏移量)。
编辑:一个简单的工作示例:
在这里,您可以看到单词/元素偏移量已应用于第二个 cudaMemcpy 调用中的设备指针,以从第二个单词(而不是第一个单词)开始复制。
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
Here, you can see a word/element offset has been applied to the device pointer in the second
cudaMemcpy
call to start the copy from the second word, not the first.指针算术确实适用于主机端代码,它在 nvidia 提供的示例代码中经常使用。
“线性内存存在于设备上的 40 位地址空间中,因此单独分配的实体可以通过指针相互引用,例如在二叉树中。”
阅读更多信息:http://docs.nvidia.com /cuda/cuda-c-programming-guide/index.html#ixzz4KialMz00
来自性能原语 (npp) 文档,这是指针算术的完美示例。
4.5.1 选择通道源图像指针
这是指向源图像第一个像素内感兴趣通道的指针。例如,如果 pSrc 是
指向三通道图像 ROI 内的第一个像素的指针。使用适当的选择通道副本
原始的可以将该源图像的第二通道复制到目标的第一通道
由 pDst 通过将指针偏移 1 给出的图像:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*注意:此操作无需乘以每个数据元素的字节数,因为编译器知道指针的数据类型,并相应地计算地址。
在 C 和 C++ 中,指针算术可以按上述方式完成,也可以通过符号 &ptr[offset] 完成(返回数据的设备内存地址而不是值,值不适用于主机端代码的设备内存)。使用任一表示法时,都会自动处理数据类型的大小,并且将偏移量指定为数据元素数量而不是字节。
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4KialMz00
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.