使用 CUDA 显示 GPU 优于 CPU 的最简单示例
我正在寻找最简洁的代码量,可以为 CPU(使用 g++)和 GPU(使用 nvcc)编写代码,并且 GPU 的性能始终优于 CPU。任何类型的算法都是可以接受的。
澄清一下:我实际上是在寻找两个短代码块,一个用于 CPU(在 g++ 中使用 C++),另一个用于 GPU(在 nvcc 中使用 C++),GPU 的性能优于 GPU。最好以秒或毫秒为单位。尽可能短的代码对。
I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
首先,我将重申我的评论:GPU 具有高带宽、高延迟的特点。试图让 GPU 在纳秒作业(甚至毫秒或秒作业)上击败 CPU 完全没有实现 GPU 的目的。下面是一些简单的代码,但要真正体会 GPU 的性能优势,您需要一个大的问题规模来分摊启动成本……否则,它毫无意义。我可以在两英尺比赛中击败法拉利,仅仅是因为转动钥匙、启动发动机和踩踏板需要一些时间。这并不意味着我在任何意义上都比法拉利更快。
在 C++ 中使用类似的内容:
在 CUDA/C 中使用类似的内容:
如果这不起作用,请尝试使 N 和 M 更大,或将 256 更改为 128 或 512。
First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
Use something like this in CUDA/C:
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.
作为参考,我做了一个类似的时间测量示例。使用 GTX 660,GPU 加速为 24 倍,其操作除了实际计算之外还包括数据传输。
For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.
一种非常非常简单的方法是计算前 100,000 个整数的平方,或者计算大型矩阵运算。它很容易实现,并且通过避免分支、不需要堆栈等来发挥 GPU 的优势。我不久前使用 OpenCL 与 C++ 进行了此操作,并得到了一些非常惊人的结果。 (2GB GTX460 的性能大约是双核 CPU 的 40 倍。)
您是在寻找示例代码,还是只是想法?
编辑
40x 是与双核 CPU 相比,而不是四核。
一些提示:
正如我在对 @Paul R 的评论回复中所说,考虑使用 OpenCL,因为它可以轻松地让您在 GPU 和 CPU 上运行相同的代码,而无需必须重新实现它。
(回想起来,这些可能是非常明显的。)
A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
As I said in my comment response to @Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)
我同意 David 的评论,认为 OpenCL 是测试这一点的好方法,因为在 CPU 和 GPU 上运行代码之间切换是多么容易。如果您能够在 Mac 上工作,Apple 有一些很好的示例代码,可以执行 使用 OpenCL 进行 N 体模拟,内核在 CPU、GPU 或两者上运行。您可以在它们之间实时切换,并且 FPS 计数显示在屏幕上。
对于更简单的情况,他们有一个 "hello world" OpenCL命令行应用程序,以类似于 David 描述的方式计算平方。这或许可以毫不费力地移植到非 Mac 平台。要在 GPU 和 CPU 使用率之间切换,我相信您只需将
hello.c 源文件中的行更改为 0 表示 CPU,1 表示 GPU。
Apple 在其 中提供了更多 OpenCL 示例代码主要 Mac 源代码列表。
David Gohara 博士在这个介绍性视频会议的最后提供了一个 OpenCL GPU 在执行分子动力学计算时加速的示例主题(大约第 34 分钟左右)。在他的计算中,他发现从在 8 个 CPU 内核上运行的并行实现改为在单个 GPU 上运行,速度大约提高了 27 倍。同样,这不是最简单的示例,但它展示了真实世界的应用程序以及在 GPU 上运行某些计算的优势。
我还做了一些修补移动空间使用 OpenGL ES 着色器执行基本计算。我发现,在 GPU 上作为着色器运行时,在图像上运行的简单颜色阈值着色器比在该特定设备的 CPU 上执行的相同计算快大约 14-28 倍。
I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.
For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the
line in the hello.c source file to 0 for CPU, 1 for GPU.
Apple has some more OpenCL example code in their main Mac source code listing.
Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.
I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.