Boost::thread() 和 Nvidia CUDA 是否存在某种不兼容性?
我正在开发一个通用的流式 CUDA 内核执行框架,它允许并行数据复制和复制。在 GPU 上执行。
目前,我正在 C++ 静态函数包装器中调用 cuda 内核,因此我可以从 .cpp 文件(而不是 .cu)调用内核,如下所示:
//kernels.cu:
//kernel definition
__global__ void kernelCall_kernel( dataRow* in, dataRow* out, void* additionalData){
//Do something
};
//kernel handler, so I can compile this .cu and link it with the main project and call it within a .cpp file
extern "C" void kernelCall( dataRow* in, dataRow* out, void* additionalData){
int blocksize = 256;
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(tableSize/(float)blocksize));
kernelCall_kernel<<<dimGrid,dimBlock>>>(in, out, additionalData);
}
如果我将处理程序作为普通函数调用,则打印的数据是正确的。
//streamProcessing.cpp
//allocations and definitions of data omitted
//copy data to GPU
cudaMemcpy(data_d,data_h,tableSize,cudaMemcpyHostToDevice);
//call:
kernelCall(data_d,result_d,null);
//copy data back
cudaMemcpy(result_h,result_d,resultSize,cudaMemcpyDeviceToHost);
//show result:
printTable(result_h,resultSize);// this just iterate and shows the data
但是为了允许在 GPU 上并行复制和执行数据,我需要创建一个线程,因此当我调用它创建一个新的 boost::thread: 时,
//allocations, definitions of data,copy data to GPU omitted
//call:
boost::thread* kernelThreadOwner = new boost::thread(kernelCall, data_d,result_d,null);
kernelThreadOwner->join();
//Copy data back and print ommited
我在最后打印结果时只会得到垃圾。
目前我只使用一个线程,用于测试目的,因此直接调用它或创建线程应该没有太大区别。我不知道为什么直接调用该函数会给出正确的结果,而创建线程时却不会。这是 CUDA 和 CUDA 的问题吗?促进?我错过了什么吗?谢谢您的建议。
I'm developing a generic streaming CUDA kernel execution Framework that allows parallel data copy & execution on the GPU.
Currently I'm calling the cuda kernels within a C++ static function wrapper, so I can call the kernels from a .cpp file (not .cu), like this:
//kernels.cu:
//kernel definition
__global__ void kernelCall_kernel( dataRow* in, dataRow* out, void* additionalData){
//Do something
};
//kernel handler, so I can compile this .cu and link it with the main project and call it within a .cpp file
extern "C" void kernelCall( dataRow* in, dataRow* out, void* additionalData){
int blocksize = 256;
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(tableSize/(float)blocksize));
kernelCall_kernel<<<dimGrid,dimBlock>>>(in, out, additionalData);
}
If I call the handler as a normal function, the data printed is right.
//streamProcessing.cpp
//allocations and definitions of data omitted
//copy data to GPU
cudaMemcpy(data_d,data_h,tableSize,cudaMemcpyHostToDevice);
//call:
kernelCall(data_d,result_d,null);
//copy data back
cudaMemcpy(result_h,result_d,resultSize,cudaMemcpyDeviceToHost);
//show result:
printTable(result_h,resultSize);// this just iterate and shows the data
But to allow parallel copy and execution of data on the GPU I need to create a thread, so when I call it making a new boost::thread:
//allocations, definitions of data,copy data to GPU omitted
//call:
boost::thread* kernelThreadOwner = new boost::thread(kernelCall, data_d,result_d,null);
kernelThreadOwner->join();
//Copy data back and print ommited
I just get garbage when printing the result on the end.
Currently I'm just using one thread, for testing purpose, so there should be no much difference in calling it directly or creating a thread. I have no clue why calling the function directly gives the right result, and when creating a thread not. Is this a problem with CUDA & boost? Am I missing something? Thank you in advise.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题在于(CUDA 4.0 之前的版本)CUDA 上下文与创建它们的线程相关联。当您使用两个线程时,您就有两个上下文。主线程分配和读取的上下文与内部运行内核的线程的上下文并不相同。内存分配在上下文之间不可移植。它们是同一 GPU 内有效分离的内存空间。
如果您想以这种方式使用线程,您要么需要重构,以便一个线程仅与 GPU“对话”,并通过 CPU 内存与父线程通信,要么使用 CUDA 上下文迁移 API,该 API 允许上下文从一个线程移动到另一个线程(通过 cuCtxPushCurrent 和 cuCtxPopCurrent)。请注意,上下文迁移不是免费的,并且存在延迟,因此,如果您计划频繁迁移上下文,您可能会发现更改为保留上下文线程关联性的不同设计会更有效。
The problem is that (pre CUDA 4.0) CUDA contexts are tied to the thread in which they were created. When you are using two threads, you have two contexts. The context that the main thread is allocating and reading from, and the context that the thread which runs the kernel inside are not the same. Memory allocations are not portable between contexts. They are effectively separate memory spaces inside the same GPU.
If you want to use threads in this way, you either need to refactor things so that one thread only "talks" to the GPU, and communicates with the parent via CPU memory, or use the CUDA context migration API, which allows a context to be moved from one thread to another (via cuCtxPushCurrent and cuCtxPopCurrent). Be aware that context migration isn't free, and there is latency involved, so if you plan to migrating contexts around frequently, you might find it more efficient to change to a different design which preserves context-thread affinity.