CUDA 指针解除引用问题

发布于 2024-09-14 23:16:26 字数 453 浏览 5 评论 0原文

我正在使用 cuda sdk 和 9600 1 GB NVidia 卡开发一个程序。在 该程序

0) 内核在其输入参数中传递大小为 3000x6 的 2D int 数组的指针。

1) 犬舍必须将其分类为 3 级(第一、第二和第三列)。

2) 为此,内核声明一个大小为 3000 的 int 指针数组。

3) 然后,内核按排序顺序用指向输入数组位置的指针填充指针数组。

4)最后,内核通过取消引用指针数组将输入数组复制到输出数组中。

最后一步失败并停止 PC。

Q1)cuda 中指针取消引用以获取内存内容的准则是什么?

,即使是最小的 20x2 数组也无法正常工作。相同的代码在 cuda 设备内存之外工作(即在标准 C 程序上)

Q2)它是否应该与我们在标准 C 中使用“*”运算符一样工作,或者有一些 cudaapi 可用于它。

I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In
this program

0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.

1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).

2)For this purpose, the kernel declares an array of int pointers of size 3000.

3)The kernel then populates the pointer array with the pointers pointing to the locations of input array in sorted order.

4)Finally the kernel copies the input array in an output array by dereferencing the pointers array.

This last step Fails an it halts the PC.

Q1)What are the guidelines of pointer de-referncing in cuda to fetch the contents of memory ?

, even a smallest array of 20x2 is not working correctly . the same code works outside cuda device memory ( ie, on standard C program )

Q2)Isn't it supposed to work the same as we do in standard C using '*' operator or there is some cudaapi to be used for it.?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

阪姬 2024-09-21 23:16:26

我刚刚开始研究 cuda,但我实际上只是从书上读到了这一点。听起来它直接适用于你。

“您可以将使用 cudaMalloc() 分配的指针传递给在设备上执行的函数。(内核,对吧?)

您可以使用使用 cudaMalloc() 分配的指针从设备上执行的代码中读取或写入内存。(再次是内核)

您可以将使用 cudaMalloc 分配的指针传递给在主机上执行的函数(常规 C 代码)。

您不能使用使用 cudaMalloc() 分配的指针从主机上执行的代码中读取或写入内存。”

  • ^^ 摘自 Jason Sanders 和 Edward Kandrot 所著的“Cuda by Example”,由 Addison-Wesley 出版 yadda yadda 这里没有抄袭。

由于您在内核内部取消引用,因此也许与最后一条规则相反的情况也是如此。即,您不能使用主机分配的指针来从设备上执行的代码读取或写入内存。

编辑:我还刚刚注意到一个名为 cudaMemcpy 的函数,

看起来您需要在主机代码中声明 3000 int 数组两次。一种是调用 malloc,另一种是调用 cudaMalloc。将 cuda 1 以及要排序的输入数组传递给内核。然后在调用内核函数之后:

cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)

我确实像我说的那样开始研究这个问题,所以也许有更好的解决方案。

I just started looking into cuda, but I literally just read this out of a book. It sounds like it directly applies to you.

"You can pass pointers allocated with cudaMalloc() to functions that execute on the device.(kernals, right?)

You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device .(kernals again)

You can pass pointers allocated with cudaMalloc to functions that execute on the host. (regular C code)

You CANNOT use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host."

  • ^^ from "Cuda by Example" by Jason Sanders and Edward Kandrot published by Addison-Wesley yadda yadda no plagiarism here.

Since you are dereferencing inside the kernal, maybe the opposite of the last rule is also true. i.e. you cannot use pointers allocated by the host to read or write memory from code that executes on the device.

Edit: I also just noticed a function called cudaMemcpy

Looks like you would need to declare the 3000 int array twice in host code. One by calling malloc, the other by calling cudaMalloc. Pass the cuda one to the kernal as well as the input array to be sorted. Then after calling the kernal function:

cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)

I literally just started looking into this like I said though so maybe theres a better solution.

女中豪杰 2024-09-21 23:16:26

CUDA 代码可以以与主机代码完全相同的方式使用指针(例如,使用 * 或 [] 取消引用、普通指针算术等)。然而,重要的是要考虑正在访问的位置(即指针指向的位置)必须对 GPU 可见。

如果您使用 malloc() 或 std::vector 分配主机内存,则该内存对 GPU 不可见,它是主机内存而不是设备内存。要分配设备内存,您应该使用 cudaMalloc() - 可以从设备自由访问指向使用 cudaMalloc() 分配的内存的指针,但不能从主机访问。

要在两者之间复制数据,请使用 cudaMemcpy()。

当您更高级时,界限可能会有点模糊,使用“映射内存”可以允许 GPU 访问部分主机内存,但这必须以特定方式处理,请参阅 CUDA 编程指南以获取更多信息。

我强烈建议您查看 CUDA SDK 示例,了解这一切是如何工作的。也许可以从 vectorAdd 示例开始,以及任何特定于您的专业领域的示例。矩阵乘法和转置可能也很容易理解。

所有文档、工具包和代码示例 (SDK) 均可在 CUDA 开发人员网站< /a>.

CUDA code can use pointers in exactly the same manner as host code (e.g. dereference with * or [], normal pointer arithmetic and so on). However it is important to consider the location being accessed (i.e. the location to which the pointer points) must be visible to the GPU.

If you allocate host memory, using malloc() or std::vector for example, then that memory will not be visible to the GPU, it is host memory not device memory. To allocate device memory you should use cudaMalloc() - pointers to memory allocated using cudaMalloc() can be freely accessed from the device but not from the host.

To copy data between the two, use cudaMemcpy().

When you get more advanced the lines can be blurred a little, using "mapped memory" it is possible to allow the GPU to access parts of host memory but this must be handled in a particular way, see the CUDA Programming Guide for more information.

I'd strongly suggest you look at the CUDA SDK samples to see how all this works. Start with the vectorAdd sample perhaps, and any that are specific to your domain of expertise. Matrix multiplication and transpose are probably easy to digest too.

All the documentation, the toolkit and the code samples (SDK) are available on the CUDA developer web site.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文