使用 CUDA 构建自动并行计算库

发布于 2024-10-12 19:22:58 字数 927 浏览 2 评论 0原文

对于我最后一年的项目,我选择构建一个库,开发人员可以使用 CUDA 进行 GPGPU 计算,而无需了解 CUDA API 的不同内核实现背后的机制(换句话说,是 CUDA 包装器)。该库可能类似于 openMP 库。对于那些不熟悉 openMP 的人来说,它是一个支持 C 语言多平台共享内存多处理编程的 API,其中数据布局和分解由指令自动处理。例如,API 将每个代码并行化为块:

 long sum = 0, loc_sum = 0;
 /*forks off the threads and starts the work-sharing construct*/
 #pragma omp parallel for private(w,loc_sum) schedule(static,1) 
 {
   for(i = 0; i < N; i++)
     {
       w = i*i;
       loc_sum = loc_sum + w*a[i];
     }
   #pragma omp critical
   sum = sum + loc_sum;
 }
 printf("\n %li",sum);

在我的例子中,我想在 GPU 上为 CUDA 并行计算实现相同的功能。因此,我需要构建一组影响运行时行为的编译器指令、库例程和环境变量。 CUDA 中的每个调用都必须对程序员隐藏。

由于 CUDA 是 SIMD 架构,因此我知道必须考虑很多因素,尤其是迭代之间的依赖性。但现在我想程序员知道 GPGPU 计算的局限性。

现在,我需要你的帮助。有人能给我关于从哪里开始建立这样一个图书馆的建议吗?另外,有人有任何好的教程可以帮助我处理编译器指令或环境变量吗?或者,有谁知道任何其他库可以执行类似的任务并且我可以从中获得良好的文档?

而且最重要的是,你认为这是一个可以在 1200 小时左右完成的项目吗?我已经对 GPGPU 和 CUDA 有点熟悉了,但是构建这样的库对我来说是新的。

For my final year project, I have chosen to build a library developers could use to do GPGPU computing with CUDA without having to understand the mechanisms behind the different kernel implementations of the CUDA API (a CUDA wrapper, in other words). This library would likely resemble the openMP library. For those who are unfamiliar with openMP, it is an API that supports multi-platform shared memory multiprocessing programming in C where data layout and decomposition is handled automatically by directives. For example, the API parallelizes each code in blocks:

 long sum = 0, loc_sum = 0;
 /*forks off the threads and starts the work-sharing construct*/
 #pragma omp parallel for private(w,loc_sum) schedule(static,1) 
 {
   for(i = 0; i < N; i++)
     {
       w = i*i;
       loc_sum = loc_sum + w*a[i];
     }
   #pragma omp critical
   sum = sum + loc_sum;
 }
 printf("\n %li",sum);

In my case, I would like to implement the same functionality for CUDA parallel computing on the GPU. Hence, I will need to build a set of compiler directives, library routines, and environment variables that influence run-time behavior. Every call in CUDA must be hidden from the programmer.

Since CUDA is an SIMD architecture, I know there are many factors that have to be accounted for, especially on the dependancy between iterations. But for now I suppose that the programmer knows the limitations of GPGPU computing.

Now, here is where I need your help. Could anyone one give me any advice on where to start to build such a library? Also, does anyone have any good tutorials that could help me deal with compiler directives or environnement variables? Or, does anyone know any other library that does a similar task and from which I could get a good documentation?

And most importantly, do you think this is a project that can be done in about 1200 hours? I am already a bit familiar with GPGPU and CUDA, but building such a library is new to me.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不打扰别人 2024-10-19 19:22:58

这与其说是编写库,不如说是重写编译器的一部分。一方面,GCC 和 Visual Studio 都不允许您定义自己的编译指示,并且您需要很好地使用内置优化器。

老实说,在我看来,实际的 GPGPU 部分是比较简单的部分。

如果您想了解他们如何在 GCC 中执行 OpenMP,我建议您查看 GOMP 项目 历史记录。

This isn't so much writing a library as rewriting part of the compiler. Neither GCC nor Visual Studio let you define your own pragmas, for one thing, and you'd need to play nicely with the built-in optimizer.

Honestly, it seems to me that the actual GPGPU part of this is the easy part.

If you want to see how they did OpenMP in GCC, I suggest looking at the GOMP project history.

青春如此纠结 2024-10-19 19:22:58

这有点主观,但这听起来是一个非常具有挑战性的项目。需要大量的思考和规划来足够好地构造一个问题,以使从主机到 GPU 的数据传输得到回报,而且它只对一小部分问题有意义。

就执行类似操作的现有项目而言,有一些简单的包装器,例如 PyCUDAPyOpenCL 包装了矩阵数学等少量 GPU 功能。最接近的可能是 theano,它专注于相当数学计算,但它很好地抽象出了 GPU 组件。

This is a bit subjective, but this sounds like a very challenging project. It takes a fair amount of thought and planning to structure a problem well enough to make the data transfer from host to gpu pay off, and it only makes sense for a subset of problems.

As far as existing projects that do something similar, there are simple wrappers like PyCUDA and PyOpenCL that wrap small bits of GPU functionality like matrix math. The one that is perhaps closest is theano, which is focused on fairly mathematical computations, but which does a good job abstracting away the GPU component.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文