将数据从 CPU 传递到 GPU，而不将其作为参数显式传递

发布于 2024-12-09 17:17:14 字数 361 浏览 3 评论 0原文

是否可以将数据从 CPU 传递到 GPU 而无需显式将其作为参数传递？

我不想将其作为参数传递，主要是出于语法糖的原因 - 我需要传递大约 20 个常量参数，而且还因为我连续调用两个具有（几乎）相同参数的内核。

我想要类似的东西

__constant__ int* blah;

__global__ myKernel(...){
    ... i want to use blah inside ...
}

int main(){
    ...
    cudaMalloc(...allocate blah...)
    cudaMemcpy(copy my array from CPU to blah)

}

原文

Is it possible to pass the data from CPU to GPU without explicitly passing it as a parameter?

I don't want to pass it as a parameter primarily for syntax sugar reasons - I have about 20 constant parameters I need to pass, and also because I successively invoke two kernels with (almost) same parameters.

I want something along the lines of

__constant__ int* blah;

__global__ myKernel(...){
    ... i want to use blah inside ...
}

int main(){
    ...
    cudaMalloc(...allocate blah...)
    cudaMemcpy(copy my array from CPU to blah)

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

又爬满兰若 2024-12-16 17:17:14

cudaMemcpyToSymbol 似乎是您正在寻找的功能。它的工作原理类似于 cudaMemcpy，但有一个额外的“偏移”参数，看起来它可以更轻松地跨 2D 数组进行复制。

（我犹豫是否提供代码，因为我无法测试它 - 但请参阅 this线程和这篇文章供参考。）

回复收藏 0 原文

错々过的事 2024-12-16 17:17:14

使用 __device__ 来应用全局变量。类似于使用__constant__的方式

回复收藏 0 原文

负佳期 2024-12-16 17:17:14

您可以采取一些方法。这取决于您将如何使用该数据。

如果您的模式访问是常量，并且块内的线程读取同一位置，请使用 __constant__ 内存来广播读取请求。
如果您的模式访问与给定位置的邻居相关，或者具有随机访问（不合并），那么我建议使用纹理内存
如果您需要读/写 data 并知道数组的大小，将其定义为内核中的 __device__ blah[size] 。

例如：

__constant__ int c_blah[65536]; // constant memory
__device__ int g_blah[1048576]; // global memory

__global__ myKernel() {
    // ... i want to use blah inside ...
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    // get data from constant memory
    int c = c_blah[idx];
    // get data from global memory
    int g = g_blah[idx];
    // get data from texture memory
    int t = tex1Dfetch(ref, idx);
    // operate
    g_blah[idx] = c + g + t;
}


int main() {
    // declare array in host
    int c_h_blah[65536]; // and initialize it as you want
    // copy from host to constant memory
    cudaMemcpyToSymbol(c_blah, c_h_blah, 65536*sizeof(int), 0, cudaMemcpyHostToDevice);
    // declare other array in host
    int g_h_blah[1048576]; // and initialize it as you want
    // declare one more array in host
    int t_h_blah[1048576]; // and initialize it as you want
    // declare a texture reference
    texture<int, 1, cudaReadModeElementType> tref;
    // bind the texture to the array
    cudaBindTexture(0,tref,t_h_blah, 1048576*sizeof(int));
    // call your kernel
    mykernel<<<dimGrid, dimBlock>>>();
    // copy result from GPU to CPU memory
    cudaMemcpy(g_h_blah, g_blah, 1048576*sizeof(int), cudaMemcpyDeviceToHost);
}

您可以在内核中使用三个数组，而无需向内核传递任何参数。注意，这只是一个使用示例，并不是内存层次结构的优化使用，即：不建议以这种方式使用常量内存。

希望这有帮助。

You can take some approaches. It depends on how you are going to use that data.

If your pattern access is constant and threads within a block read the same location, use __constant__ memory to broadcast the read requests.
If your pattern access is related to the neighbors of a given position, or with random access (not coalesced), then I'll recommend use texture memory
If you need read/write data and know the size of your array define it as __device__ blah[size] in your kernel.

In example:

__constant__ int c_blah[65536]; // constant memory
__device__ int g_blah[1048576]; // global memory

__global__ myKernel() {
    // ... i want to use blah inside ...
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    // get data from constant memory
    int c = c_blah[idx];
    // get data from global memory
    int g = g_blah[idx];
    // get data from texture memory
    int t = tex1Dfetch(ref, idx);
    // operate
    g_blah[idx] = c + g + t;
}


int main() {
    // declare array in host
    int c_h_blah[65536]; // and initialize it as you want
    // copy from host to constant memory
    cudaMemcpyToSymbol(c_blah, c_h_blah, 65536*sizeof(int), 0, cudaMemcpyHostToDevice);
    // declare other array in host
    int g_h_blah[1048576]; // and initialize it as you want
    // declare one more array in host
    int t_h_blah[1048576]; // and initialize it as you want
    // declare a texture reference
    texture<int, 1, cudaReadModeElementType> tref;
    // bind the texture to the array
    cudaBindTexture(0,tref,t_h_blah, 1048576*sizeof(int));
    // call your kernel
    mykernel<<<dimGrid, dimBlock>>>();
    // copy result from GPU to CPU memory
    cudaMemcpy(g_h_blah, g_blah, 1048576*sizeof(int), cudaMemcpyDeviceToHost);
}

You can use three arrays in the kernel without pass any parameter to the kernel. Note this is only an example of use and not an optimized use of the memory hierarchy, i.e.: Use the constant memory in this way is not recommended.

Hope this help.

回复收藏 0 原文