CUDA纹理内存绑定全局内存的子部分

发布于 2024-10-29 07:06:51 字数 766 浏览 1 评论 0原文

我在绑定到全局设备内存的子部分纹理内存时遇到问题。

我有一个充满内存的大型全局设备数组,如下所示:

双* device_global;

cudaMalloc((void **)&device_global, sizeof(double)*N));

cudaMemcpy(device_global, 主机, sizeof(double)*N, cudaMemcpyHostToDevice) );

我在 for 循环中运行多个内核。

每个内核都需要 device_global 的一小部分 (int offset = 100),我通过以下方式将其绑定到纹理:

cudaBindTexture(0、texRef、device_global、channelDesc、sizeof(double)*10);

然而,我面临的问题是,我无法使用指针算术通过循环的偏移量来仅绑定 device_global 的循环部分。

我想做这样的事情:

cudaBindTexture(0, texRef, device_global+ offsett * i , channelDesc, sizeof(double)*10);

应该注意的是,如果偏移量设置为 0,上述方法确实有效,但不知何故,指针算术不起作用。

任何帮助或其他指导方针将不胜感激。

I am having problem binding to texture memory a sub-portion of global device memory.

I have a large global device array filled with memory as follows:

double * device_global;

cudaMalloc((void **)&device_global, sizeof(double)*N));

cudaMemcpy(device_global, host, sizeof(double)*N, cudaMemcpyHostToDevice) );

I am running numerous kernels in a for loop.

Each kernel required a small portion (int offset = 100) of device_global which I am binding to a texture through:

cudaBindTexture(0, texRef, device_global, channelDesc, sizeof(double)*10);

However the problem I am facing is that I am unable to use pointer arithmetic to only bind a looping section of device_global via an offset that loops.

I would like to do something like:

cudaBindTexture(0, texRef, device_global+ offsett * i , channelDesc, sizeof(double)*10);

it should be noted that the above approach does work if the offset is set to 0, somehow the pointer arithmetic does not work.

Any help or other guidelines would be much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

美人如玉 2024-11-05 07:06:51

0NULL 作为 cudaBindTexture 的第一个参数传递是一种不好的做法。 CUDA纹理绑定要求要绑定的指针必须对齐。对齐要求可以通过cudaDeviceProp::textureAlignment设备属性来确定。

cudaBindTexture 可以将任何设备指针绑定到纹理。如果指针未对齐,它将返回距 cudaBindTexture 第一个参数中最近的前面对齐地址的偏移量(以字节为单位)。如果第一个参数为 NULL,则函数调用失败。

绑定应按以下方式完成:

size_t texture_offset = 0;
cudaBindTexture(&texture_offset, texRef, device_global+ offsett * i , channelDesc, sizeof(double)*10);

It's a bad practice to pass 0 or NULL as the first argument of cudaBindTexture. CUDA texture binding requires that the pointer to be bound must be aligned. The alignment requirement can be determined by cudaDeviceProp::textureAlignment device property.

cudaBindTexture can bind any device pointer to the texture. If the pointer is not aligned, it returns an offset in bytes from the nearest preceding aligned address in the first argument of cudaBindTexture. If the first argument is NULL, the function call fails.

Binding should be done as:

size_t texture_offset = 0;
cudaBindTexture(&texture_offset, texRef, device_global+ offsett * i , channelDesc, sizeof(double)*10);
倾其所爱 2024-11-05 07:06:51

纹理内存的偏移量必须对齐。您不能仅将内存的任何部分绑定到正确对齐的部分,这是因为内部高性能硬件的工作方式所致。

一种解决方案是使用倾斜内存而不是使用非常小的纹理
有几个大的,每个都从矩阵的对齐行开始。

我在这里猜测,但我认为用作

sizeof(double)*10

纹理内存的数据大小,设置内存本身比读取它需要更多的时间。

总矩阵有多大?

The offset of the Texture Memory must be aligned. You can't bind any portion of the memory only the one that is properly aligned and this is because of how the internal high performance hardware works.

One solution would be to use Pitched Memory and instead of having very small texture
have several big ones each starting at an aligned row of the matrix.

I am guessing here but I think that using

sizeof(double)*10

as a datasize for texture memory, takes more to setup the memory itself than to read it.

How big is the total matrix?

夢归不見 2024-11-05 07:06:51

我不相信有可能做你想做的事。我怀疑有一些幕后地址转换,这意味着如果运行时内存管理器尚不知道传递给绑定调用的指针并且与页面边界适当对齐,则它将不允许绑定纹理到地址。

最好将整个数组绑定到纹理,然后将索引偏移量传递到每个内核以在纹理获取中使用。

I don't believe it is possible to do what you are trying to do. I suspect there is some behind the scenes address translation that means that if the pointer you pass to the binding call isn't already known to the runtime memory manager and suitably aligned to a page boundary, it won't permit a texture to be bound to the address.

It might be better to bind the whole array to the texture and then pass an indexing offset into each kernel to be used in the texture fetch.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文