使用 CUDA 构建自动并行计算库

发布于 2024-10-12 19:22:58 字数 927 浏览 2 评论 0原文

对于我最后一年的项目，我选择构建一个库，开发人员可以使用 CUDA 进行 GPGPU 计算，而无需了解 CUDA API 的不同内核实现背后的机制（换句话说，是 CUDA 包装器）。该库可能类似于 openMP 库。对于那些不熟悉 openMP 的人来说，它是一个支持 C 语言多平台共享内存多处理编程的 API，其中数据布局和分解由指令自动处理。例如，API 将每个代码并行化为块：

 long sum = 0, loc_sum = 0;
 /*forks off the threads and starts the work-sharing construct*/
 #pragma omp parallel for private(w,loc_sum) schedule(static,1) 
 {
   for(i = 0; i < N; i++)
     {
       w = i*i;
       loc_sum = loc_sum + w*a[i];
     }
   #pragma omp critical
   sum = sum + loc_sum;
 }
 printf("\n %li",sum);

在我的例子中，我想在 GPU 上为 CUDA 并行计算实现相同的功能。因此，我需要构建一组影响运行时行为的编译器指令、库例程和环境变量。 CUDA 中的每个调用都必须对程序员隐藏。

由于 CUDA 是 SIMD 架构，因此我知道必须考虑很多因素，尤其是迭代之间的依赖性。但现在我想程序员知道 GPGPU 计算的局限性。

现在，我需要你的帮助。有人能给我关于从哪里开始建立这样一个图书馆的建议吗？另外，有人有任何好的教程可以帮助我处理编译器指令或环境变量吗？或者，有谁知道任何其他库可以执行类似的任务并且我可以从中获得良好的文档？

而且最重要的是，你认为这是一个可以在 1200 小时左右完成的项目吗？我已经对 GPGPU 和 CUDA 有点熟悉了，但是构建这样的库对我来说是新的。

原文

For my final year project, I have chosen to build a library developers could use to do GPGPU computing with CUDA without having to understand the mechanisms behind the different kernel implementations of the CUDA API (a CUDA wrapper, in other words). This library would likely resemble the openMP library. For those who are unfamiliar with openMP, it is an API that supports multi-platform shared memory multiprocessing programming in C where data layout and decomposition is handled automatically by directives. For example, the API parallelizes each code in blocks:

 long sum = 0, loc_sum = 0;
 /*forks off the threads and starts the work-sharing construct*/
 #pragma omp parallel for private(w,loc_sum) schedule(static,1) 
 {
   for(i = 0; i < N; i++)
     {
       w = i*i;
       loc_sum = loc_sum + w*a[i];
     }
   #pragma omp critical
   sum = sum + loc_sum;
 }
 printf("\n %li",sum);

In my case, I would like to implement the same functionality for CUDA parallel computing on the GPU. Hence, I will need to build a set of compiler directives, library routines, and environment variables that influence run-time behavior. Every call in CUDA must be hidden from the programmer.

Since CUDA is an SIMD architecture, I know there are many factors that have to be accounted for, especially on the dependancy between iterations. But for now I suppose that the programmer knows the limitations of GPGPU computing.

Now, here is where I need your help. Could anyone one give me any advice on where to start to build such a library? Also, does anyone have any good tutorials that could help me deal with compiler directives or environnement variables? Or, does anyone know any other library that does a similar task and from which I could get a good documentation?

And most importantly, do you think this is a project that can be done in about 1200 hours? I am already a bit familiar with GPGPU and CUDA, but building such a library is new to me.

分享到QQ

分享到微博