学习共享内存、分布式内存和/或 GPU 编程的示例问题和解决方案是什么?
我们正在寻找将在任何或所有共享内存、分布式内存和 GPGPU 架构上运行的示例问题和代码。我们使用的参考平台是LittleFe (littlefe.net),这是一个开放式设计、低成本的教育集群,目前拥有六个双核CPU,每个CPU 都配有nVidia 芯片组。
这些问题和解决方案通过提供工作示例和卷起袖子编码的机会,有助于向任何新手教授并行性。 Stackoverflow 专家具有良好的洞察力,并且可能有一些最爱。
计算曲线下的面积很有趣、简单且易于理解,但肯定有一些方法同样易于表达,并且充满了练习和学习的机会。
使用多种内存架构的混合示例是最理想的,并且反映了并行编程的趋势。
在 LittleFe 上,我们主要使用三个应用程序。第一个是对飞镖上最佳目标的分析,该目标高度并行,通信开销很小。第二个是康威的生命游戏,这是共享边界条件问题的典型。它具有适度的通信开销。第三个是星系形成的 n 体模型,需要大量的通信开销。
We are looking for exemplar problems and codes that will run on any or all of shared memory, distributed memory, and GPGPU architectures. The reference platform we are using is LittleFe (littlefe.net), an open-design, low cost educational cluster currently with six dual core CPUs, each with an nVidia chipset.
These problems and solutions will be good for teaching parallelism to any newbie by providing working examples and opportunities to roll up your sleeves and code. Stackoverflow experts have good insight and are likely to have some favorites.
Calculating area under a curve is interesting, simple and easy to understand, but there are bound to be ones that are just as easily expressed and chock full of opportunities to practice and learn.
Hybrid examples using more than one of the memory architectures are most desirable, and reflective of where parallel programming seems to be trending.
On LittleFe we have predominantly been using three applications. The first is an analysis of optimal targets on a dartboard which is highly parallel with little communication overhead. The second is Conway's game of life which is a typical of problems sharing boundary conditions. It has a moderate communication overhead. The third is an n-body model of galaxy formation which requires heavy communication overhead.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
CUDA 编程指南包含详细分析矩阵乘法在GPU上的实现。这似乎是学习 GPU 编程的主要“hello world”示例。
此外,CUDA SDK 还包含数十个其他详细解释的 CUDA 和 OpenCL 中 GPU 编程示例。我最喜欢的是碰撞球示例。 (数千个球实时碰撞的演示)
更新:
CUDA 示例不再与工具包打包在一起。相反,您可以在 GitHub 上找到它们。
The CUDA programming guide contains a detailed analysis of the implementation of matrix multiplication on a GPU. That seems to be the staple "hello world" example for learning GPU programing.
Furthermore, the CUDA SDK contains tens of other well explained examples of GPU programming in CUDA and OpenCL. My favorite is the colliding balls example. (a demo with a few thousands of balls colliding in real time)
Update:
The CUDA samples are not packaged with the Toolkit anymore. Instead you can find them on GitHub.
我最喜欢的两个是数值整数和寻找素数。首先,我们在函数 f(x) = 4.0 / (1.0 + x*x) 上编写中点矩形规则。 0 和 1 之间的函数积分给出常数 pi 的近似值,这使得检查答案的正确性变得容易。并行性遍及积分范围(计算矩形面积)。
对于第二个,我们输入一个整数范围,然后识别并保存该范围内的素数。我们通过所有可能的因素对值进行强力划分;如果发现除数不是 1 或数字,则该值为合数。如果找到素数,则对其进行计数并将其存储在共享数组中。并行性正在划分范围,因为 N 的素数测试独立于 M 的测试。在线程之间共享素数存储或收集分布式部分答案需要一些技巧。
这些都是需要解决的非常基本且简单的问题,这使得学生能够专注于并行实现,而不是过多地关注所涉及的计算。
Two of my favorites are numerical integeration and finding prime numbers. For the first we code the midpoint rectangle rule on the function f(x) = 4.0 / (1.0 + x*x). Integration of the function between 0 and 1 give an approximation of the constant pi, which makes checking the correctness of the answer easy. The parallelism is across the range of the integration (computing the areas of rectangles).
For the second, we input an integer range and then identify and save the prime numbers in that range. We use a brute force division of values by all possible factors; if any divisors are found that are not 1 or the number, then the value is composite. If a prime is found, count it and store in a shared array. The parallelism is dividing up the range since testing for primality of N is independent of testing M. There is some trickiness needed to share the prime store between threads or to gather distributed parital answers.
These are very basic and simple problems to solve, which allows students to focus on the parallel implementation and not so much on the computation involved.
更复杂但简单的示例问题之一是 BLAS 例程 sgemm 或 dgemm (C = alpha * A x B + beta * C),其中 A、B、C 是有效大小的矩阵,而 alpha 和 beta 是标量。类型可以是单精度浮点 (sgemm) 或双精度浮点 (dgemm)。
这个简单例程在不同平台和架构上的实现教会了我们一些关于功能和工作原理的见解。有关 BLAS 和 ?gemm 例程的更多详细信息,请参阅 http://www.netlib.org/blas。
您只需注意,对于 GPU 上的双精度实现,GPU 需要具有双精度功能。
One of the more complex but easy example problems is the BLAS routine sgemm or dgemm (C = alpha * A x B + beta * C) where A, B, C are matrices of valid sizes and alpha and beta are scalars. The types may be single precision floating point (sgemm) or double precision floating point (dgemm).
The implementation of this simple routine on different platforms and architectures teaches some insights about the functionality and working principles. For more details on BLAS and the ?gemm routine have a look to http://www.netlib.org/blas.
You need only to pay attention that for a double precision implementation on the GPU the GPU needs to have double precision capabilities.