cpu 与 gpu - 当 cpu 更好时

发布于 2024-11-30 04:08:26 字数 396 浏览 5 评论 0原文

我知道很多 GPU 比 CPU 快得多的例子。但存在很难并行化的算法（问题）。你能给我一些CPU可以战胜GPU的例子或测试吗？

编辑：

感谢您的建议！我们可以对最流行和最新的 CPU 和 GPU 进行比较，例如 Core i5 2500k 与 GeForce GTX 560 Ti。

我想知道如何比较它们之间的 SIMD 模型。例如：Cuda 将 SIMD 模型更准确地称为 SIMT。但 SIMT 应该与 CPU 上的多线程进行比较，后者在 MIMD 核心之间分配线程（任务）（Core i5 2500k 为 4 个 MIMD 核心）。另一方面，每个 MIMD 内核都可以实现 SIMD 模型，但这与 SIMT 不同，我不知道如何比较它们。最后，具有并发内核执行的 fermi 架构可以被视为具有 SIMT 的 MIMD 内核。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

老街孤人 2024-12-07 04:08:26

根据我的经验，我将总结 CPU 和 GPU 并行程序在性能方面的主要差异。相信我，比较可以一代又一代地改变。所以我只想指出 CPU 和 GPU 的优点和缺点。当然，如果您制作一个极端的程序，即只有坏的或好的方面，它在一个平台上运行肯定会更快。但这些的混合需要非常复杂的推理。

主机程序级别

一个关键区别是内存传输成本。 GPU 设备需要一些内存传输。在某些情况下，例如当您必须频繁传输一些大数组时，此成本并不小。根据我的经验，这种成本可以最小化，但将大部分主机代码推送到设备代码。唯一可以这样做的情况是当您必须在程序中与主机操作系统交互时，例如输出到监视器。

设备程序级别

现在我们看到一幅尚未完全揭示的复杂图景。我的意思是GPU中有很多神秘的场景没有被披露。但是，CPU 和 GPU（内核代码）在性能方面仍然有很多区别。

我注意到很少有因素会显着导致这种差异。

工作负载分配

GPU 由许多执行单元组成，旨在处理大规模并行程序。如果你的工作很少，比如一些连续的任务，并将这些任务放在 GPU 上，那么只有少数执行单元忙碌，因此会比 CPU 慢。因为另一方面，CPU 更适合处理短时间和连续的任务。原因很简单，CPU 更加复杂并且能够利用指令级并行性，而 GPU 则利用线程级并行性。嗯，我听说 NVIDIA GF104 可以做超标量，但我没有机会体验它。

值得注意的是，在 GPU 中，工作负载被划分为小块（或 OpenCL 中的工作组），块按块排列，每个块都在一个流处理器中执行（我使用 NVIDIA 的术语）。但在 CPU 中，这些块是按顺序执行的——除了单个循环之外，我想不出其他任何东西。

因此，对于块数较少的程序，可能在 CPU 上运行得更快。

控制流指令

分支对于 GPU 来说始终是坏事。请记住，GPU 更喜欢平等的事物。相等的块、块内的相等的线程以及扭曲内的相等的线程。但最重要的是什么？

                            ***Branch divergences.***

Cuda/OpenCL 程序员讨厌分支分歧。由于所有线程都以某种方式分为 32 个线程的集合（称为 warp），并且 warp 中的所有线程都以锁步方式执行，因此分支分歧将导致 warp 中的某些线程被序列化。因此，warp的执行时间将相应地成倍增加。

与 GPU 不同，CPU 中的每个核心都可以遵循自己的路径。此外，由于 CPU 具有分支预测功能，因此可以有效地执行分支。

因此，具有更多扭曲分歧的程序可能在CPU上运行得更快。

内存访问指令

这确实很复杂，所以让我们简单介绍一下。

请记住，全局内存访问具有非常高的延迟（400-800 个周期）。因此，在老一代 GPU 中，内存访问是否合并是一个关键问题。现在您的 GTX560 (Fermi) 拥有更多 2 级缓存。因此在很多情况下可以降低全局内存访问成本。但是CPU和GPU中的缓存不同，所以它们的效果也不同。

我能说的是，它实际上取决于您的内存访问模式、内核代码模式（内存访问如何与计算交错、操作类型等）来判断是否在 GPU 或 CPU 上运行得更快。

但不知何故，你可以预期大量的缓存未命中（在 GPU 中）会对 GPU 产生非常糟糕的影响（有多糟糕？ - 这取决于你的代码）。

此外，共享内存是 GPU 的一个重要功能。访问共享内存与访问 GPU L1 缓存一样快。因此，使用共享内存的内核将有很大的好处。

我还没有真正提到的一些其他因素，但在许多情况下这些因素可能会对性能产生很大影响，例如存储体冲突、内存事务大小、GPU 占用......

Based on my experience, I will summarize the key differences in terms of performance between parallel programs in CPUs and GPUs. Trust me, a comparison can be changed from generation to generation. So I will just point out what is good and is bad for CPUs and GPUs. Of course, if you make a program at the extreme, i.e., having only bad or good sides, it will run definitely faster on one platform. But a mixture of those requires very complicated reasoning.

Host program level

One key difference is memory transfer cost. GPU devices requires some memory transfers. This cost is non-trivial in some cases, for example when you have to frequently transfer some big arrays. In my experience, this cost can be minimized but pushing most of host code to device code. The only cases you can do so are when you have to interact with the host operating system in program, such as outputting to monitor.

Device program level

Now we come to see a complex picture that hasn't been fully revealed yet. What I mean is there are many mysterious scenes in GPUs that haven't been disclosed. But still, we have a lot of distinguish CPU and GPU (kernel code) in terms of performance.

There are few factors that I noticed those dramatically contribute to the difference.

Workload distribution

GPUs, which consist of many execution units, are designed to handle massively parallel programs. If you have little of work, say a few sequential tasks, and put these tasks on a GPU, only a few of those many execution units are busy, thus will be slower than CPU. Because CPUs are, in other hand, better to handle short and sequential tasks. The reason is simple, CPUs are much more complicated and able to exploit instruction level parallelism, whereas GPUs exploit thread level parallelism. Well, I heard NVIDIA GF104 can do Superscalar, but I had no chance to experience with it though.

It is worth noting that, in GPUs, workload are divided into small blocks (or workgroups in OpenCL), and blocks are arranged in chunks, each of which is executed in one Streaming processor (I am using terminologies from NVIDIA). But in CPUs, those blocks are executed sequentially - I can't think of anything else than a single loop.

Thus, for programs that have small number of blocks, it will be likely to run faster on CPUs.

Control flow instructions

Branches are bad things to GPUs, always. Please bear in mind that GPUs prefer equal things. Equal blocks, equal threads within a blocks, and equal threads within a warp. But what matters the most?

                            ***Branch divergences.***

Cuda/OpenCL programmers hate branch divergences. Since all the threads somehow are divided into sets of 32 threads, called a warp, and all threads within a warp execute in lockstep, a branch divergence will cause some threads in the warp to be serialized. Thus, the execution time of the warp will be accordingly multiplied.

Unlike GPUs, each cores in CPUs can follow their own path. Furthermore, branches can be efficiently executed because CPUs have branch prediction.

Thus, programs that have more warp divergences are likely to run faster on CPUs.

Memory access instructions

This REALLY is complicated enough so let's make it brief.

Remember that global memory accesses have very high latency (400-800 cycles). So in old generations of GPUs, whether memory accesses are coalesced was a critical matter. Now your GTX560 (Fermi) has more 2 level of caches. So global memory access cost can be reduced in many cases. However, caches in CPUs and GPUs are different, so their effects are also different.

What I can say is that it really really depends on your memory access pattern, your kernel code pattern (how memory accesses are interleaved with computation, the types of operations, etc., ) to tell if one runs faster on GPUs or CPUs.

But somehow you can expect a huge number of cache misses (in GPUs) has a very bad effect on GPUs (how bad? - it depends on your code).

Additionally, shared memory is an important feature of GPUs. Accessing to shared memory is as fast as accessing to GPU L1 cache. So kernels that make use of shared memory will have pretty much benefit.

Some other factors I haven't really mentioned but those can have big impact on the performance in many cases such as bank conflicts, size of memory transaction, GPU occupancy...

回复收藏 0 原文

~没有更多了~