编写 CUDA 内核来替换等效的仅 CPU 功能
我有一些实现平滑粒子流体动力学的 .cpp 文件,这是一种用于模拟流体流动的粒子方法。
这些粒子技术中最耗时的部分之一是在模拟的每个时间步查找每个粒子的最近邻居(K 最近邻居或范围搜索)。
现在我只想使用 GPU 和 CUDA 加速邻居搜索例程,取代我当前基于 CPU 的邻居搜索例程。只有邻居搜索将在 GPU 上运行,而其余模拟则在 CPU 上进行。
我的问题是,我应该如何编译整个代码?更具体地说,假设我在文件 nsearch.cu
中编写邻居搜索内核函数。
那么我应该重命名所有以前的.cpp
文件为.cu
文件并使用重新编译整个集(以及nsearch.cu) nvcc
?至少对于简单的例子来说,nvcc 无法编译扩展名为 .cpp
的 CUDA 代码,即 nvcc foo.cu
可以编译,但 nvcc hello.cpp
没有。
简而言之,这个 CUDA 插件的结构应该是什么?我应该如何编译它?
我在工作中使用 Ubuntu Linux 10.10、CUDA 4.0、NVIDIA GTX 570(计算能力 2.0)和 gcc 编译器
I have some .cpp
files which implement Smoothed Particle hydrodynamics, which is a particle method for modelling fluid flow.
One of the most time consuming components in these particle techniques is finding the nearest neighbours (K-nearest neighbours or Range searching ) for every particle at every time-step of the simulation.
Right now I just want to accelerate the neighbor search routine using GPU's and CUDA, replacing my current CPU based neighbour search routine. Only neighbour search will run on the GPU's while the rest of the simulation proceeds on the CPU.
My question is, how should I go about compiling the entire code? To be more specific, suppose I write the neighbour search kernel function in a file nsearch.cu
.
Then should I rename all my previous .cpp
files as .cu
files and re-compile the whole set (along with nsearch.cu) using nvcc
? For simple examples at least, nvcc cannot compile CUDA codes with extension .cpp
i.e nvcc foo.cu
compiles but nvcc hello.cpp
doesn't.
In short, what should be the structure of this CUDA plugin and how should I go about compiling it?
I am using Ubuntu Linux 10.10, CUDA 4.0, NVIDIA GTX 570 (Compute capability 2.0) and the gcc compiler for my work
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要编写 nsearch.cu 文件并使用“nvcc -c -o nsearch.o”进行编译,然后将 nsearch.o 与主应用程序链接。必须有一个 nsearch.h 文件来导出实际内核的包装器。
You need to write the nsearch.cu file and compile it with "nvcc -c -o nsearch.o" and then link nsearch.o with the main application. There has to be a nsearch.h file that exports a wrapper around the actual kernel.
这是对你的问题的更广泛的回答,因为我经历了与你非常相似的思维过程——将我的流体动力学代码移至 GPU,同时将其他所有内容保留在 CPU 上。虽然我认为这是您应该开始的地方,但我也认为您应该开始计划将所有其他代码也转移到 GPU 上。我发现,虽然 GPU 非常擅长进行模拟所需的矩阵分解,但 GPU 和 CPU 内存之间的内存边界非常慢,以至于 80-90% 的 GPU 模拟时间都花在了 cudaMemcpyDeviceToHost/ 上。 cudaMemcpyHostToDevice。
This is a broader response to your question, since I have been through a very similar thought process to you - moving my hydrodynamic code on to GPU whilst leaving everything else on CPU. Although I think that's where you should start, I also think you should start planning to move all of your other code on to the GPU as well. What I found is that whilst the GPU was very good at doing the matrix decomposition required for my simulation, the memory boundary between GPU and CPU memory was so slow that something like 80-90% of the GPU simulation time was being spent in cudaMemcpyDeviceToHost/cudaMemcpyHostToDevice.