当前位置：文江博客话题详情

GPGPU编程是否只允许执行SIMD指令？

发布于 2024-12-08 20:44:37 字数 108 浏览 4 评论 0原文

GPGPU编程是否只允许执行SIMD指令？如果是这样，那么重写一个具有以下特征的算法一定是一项乏味的任务：被设计为在通用CPU上运行而在GPU上运行？还有一个可以转换为 SIMD 架构的算法模式？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

哆啦不做梦 2024-12-15 20:44:37

好吧，GPGPU 仅支持 SIMD 执行并不完全准确。许多 GPU 都有一些非 SIMD 组件。但总的来说，要充分利用 GPU，您需要运行 SIMD 代码。

但是，您不一定要编写 SIMD 指令。即 GPU SIMD 与 CPU SIMD 不同 - 即与编写代码以利用 x86 SSE（流 SIMD 扩展）等不同。事实上，作为使用 CPU SIMD 的人之一对你来说（我深入参与了 Intel MMX，这是最早的此类技术之一，并且一直关注着 FP SIMD 的发展）我经常觉得有必要纠正那些说像 Intel 这样的 CPU 有 SIMD 指令的人。我更喜欢将它们视为打包向量指令，尽管我勉强将它们称为 SIMD 打包向量指令集，只是因为每个人都误用了这个名称。我还强调，CPU SIMD 指令集（例如 MMX 和 SSE）可能具有 SIMD 打包向量执行单元（整数和浮点 ALU 等），但它们没有 SIMD 控制流，并且通常没有 SIMD 内存访问（又名分散/聚集（尽管英特尔 Larrabee 正在朝这个方向发展））。

我的 comp-arch.net wiki 上的一些页面与此相关（我出于爱好编写计算机体系结构）：
- http://semipublic.comp-arch.net/wiki/SIMD
- http://semipublic.comp-arch.net/wiki/SIMD_packed_vector
- http://semipublic.comp-arch.net/wiki/Difference_ Between_vector_and_packed_vector
- http://semipublic.comp-arch.net/wiki/Single_Instruction_Multiple_Threads_(SIMT)< /a>
尽管我很抱歉尚未编写讨论 SIMD 打包向量指令序列（如 Intel MMX 或 SIMD 中的序列）的页面。

但我不希望您阅读以上所有内容。让我尝试解释一下。

想象一下，您有一段代码，当以简单的标量方式编写时，看起来像这样：

// operating on an array with one million 32b floating point elements A[1000000]
for i from 0 upto 999999 do
     if some_condition(A[i]) then
           A[i] = function1(A[i])
     else
           A[i] = function2(A[i])

其中 function1() 和 function2() 足够简单，可以内联 - 比如说 function1(x) = x*x 和 function2( x) = sqrt(x)。

在CPU上。要使用 SSE 之类的东西，您必须 (1) 将数组分成块，例如 256 位 AVX 的大小，(2) 使用掩码等自行处理 IF 语句。例如：

for i from 0 upto 999999 by 8 do
     register tmp256b_1 = load256b(&A[i])
     register tmp256b_2 = tmp256b_1 * tmp256b_1
     register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
                                                 // a function, possibly inlined
                                                 // doing a Newton Raphson to evaluate sqrt.
     register mask256b = ... code that arranges for you to have 32 1s in the "lane" 
                         where some_condition is true, and 0s elsewhere...
     register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
     store256b(&A[i],tmp256b_4)

您可能不认为这有多糟糕，但请记住，这是一个简单的示例。想象一下多个嵌套的 IF，等等。或者，想象一下“some_condition”是块状的，这样你就可以通过跳过全是 function1 或全 function2 的部分来节省大量不必要的计算......

for i from 0 upto 999999 by 8 do
     register mask256b = ... code that arranges for you to have 32 1s in the "lane" 
                         where some_condition is true, and 0s elsewhere...
     register tmp256b_1 = load256b(A[i])
     if mask256b == ~0 then
         register tmp256b_2 = tmp256b_1 * tmp256b_1
         store256b(&A[i],tmp256b_2)
     else mask256b == 0 then
         register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
         store256b(&A[i],tmp256b_3)
     else
         register tmp256b_1 = load256b(&A[i])
         register tmp256b_2 = tmp256b_1 * tmp256b_1
         register tmp256b_3 = _mm_sqrt_ps(tmp256b_1)
         register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
         store256b(&A[i],tmp256b_4)

我想你能明白吗？当您有多个数组时，情况会变得更加复杂，有时数据在 256 位边界上对齐，有时则不然（典型的情况是，例如，在模板计算中，您对所有对齐进行操作）。

现在，它在 GPU 之类的东西上的大致样子是这样的：

// operating on an array with one million 32b floating point elements A[1000000]
for all i from 0 upto 999999 do
     if some_condition(A) then
           A = function1(A)
     else
           A = function2(A)

这不是看起来更像原始的标量代码吗？唯一真正的区别是您丢失了数组索引 A[i]。（实际上，一些 GPGPU 语言保留了数组索引，但我所知道的大多数语言都没有。）

现在，我省略了 (a) Open/CL 的类似 C 的语法，(b) 您需要的所有设置将 Open/CL 代码连接到您的 C 或 C++ 代码（有比 CUDA 或 OpenCL 更好的语言 - 这些语言有很多缺陷。但它们在 CPU 和 GPU 上的很多地方都可用[**]）。但我想我已经表达了问题的核心：

GPGPU 计算的关键在于你编写 SIMD，数据并行冷。但你编写它的级别比编写 CPU 风格的 SSE 代码更高。甚至比编译器内在函数更高级别。

首先，GPGPU 编译器（例如 OpenCL 或 CUDA 编译器）会在您背后处理大量数据管理工作。编译器安排执行控制流、tghe IF 语句等。

顺便说一句，请注意，正如我用 [**] 标记的那样，有时所谓的 SIMD GPGPU 编译器可以生成在 CPU 和 GPU 上运行的代码。即SIMD编译器可以生成使用CPU SIMD指令集的代码。

但 GPU 本身具有特殊的硬件支持，可以运行经过适当编译的 SIMD 代码，其速度比使用 CPU SIMD 指令在 CPU 上运行的速度快得多。最重要的是，GPU 具有更多的执行单元 - 例如，像 AMD Bulldoser 这样的 CPU 具有 2 组 128 位宽 FMACS，即每个周期能够执行 8 个 FMAC。乘以芯片上的 CPU 数量（比如 8），每个周期可能有 64 个。而现代 GPU 每个周期可能有 2,048 个 32b FMAC。即使以 1/2 或 1/4 时钟速率运行，这也是一个很大的差异。

GPU 如何拥有如此多的硬件？首先，它们通常是比 CPU 更大的芯片。但是，他们也倾向于不会将硬件花费（有人说“浪费”）在大缓存和 CPU 所花费的乱序执行等方面。 CPU 尝试快速执行一项或多项计算，而 GPU 并行执行许多计算，但单独速度比 CPU 慢。尽管如此，GPU 每秒可以执行的计算总数仍远高于 CPU 可以执行的计算总数。

FGPU 还有其他硬件优化。例如，它们运行的线程比 CPU 多得多。 Intel CPU 每个 CPU 有 2 个超线程，在 8 个 CPU 核心芯片上提供 16 个线程，而 GPU 可能有数百个线程。等等。

作为一名计算机架构师，我最感兴趣的是，许多 GPU 都对 SIMD 控制流提供特殊的硬件支持。它们使得操作这些掩码比在运行 SSE 的 CPU 上更加高效。

等等。

不管怎样，我希望我已经表达了我的观点

当你确实必须编写SIMD代码才能在GPGPU系统（如OpenCL）上运行时。
您不应将此类 SIMD 与您必须编写才能利用英特尔 SSE 的 SIMD 代码混淆。

干净多了。

越来越多的编译器允许相同的代码在 DCPU 和 GPU 上运行。即，它们越来越多地支持干净的“真实 SIMD”编码风格，而不是迄今为止利用 MMX、SSE 和 AVX 所必需的假“伪 SIMD”编码风格。这很好——这样的代码对于在 CPU 和 GPU 上编程同样“好”。但 GPU 通常运行速度要快得多。 Intel 有一篇论文，名为“揭穿 100X GPU 与 CPU 神话：对 CPU 和 GPU 吞吐量计算的评估”，http://www.hwsw.hu/kepek/hirek/2010/06/p451-lee.pdf。据称，GPU 的平均速度“仅”提高了 2.5 倍。但这是经过大量积极优化之后的结果。 GPU 代码通常更容易编写。我不了解你的情况，但我认为“仅”快 2.5 倍并不是什么值得轻视的事情。特别是因为 GPGPU 代码通常更容易阅读。

现在，天下没有免费的午餐。如果您的代码本质上是数据并行的，那就太好了。但有些科德则不然。这可能会很痛苦。

而且，与所有机器一样，GPU 也有其怪癖。

但是，如果您的代码本质上是数据并行的，那么您可能会获得很大的加速，并且代码的可读性更高。

我是一名CPU设计师。我希望借用 GPU 的很多想法来让男性 CPU 运行得更快，反之亦然。

Well, it's not quite exact that GPGPU only supports SIMD execution. Many GPUs have some non-SIMD components. But, overall, to take full advantage of a GPU you need to be running SIMD code.

However, you are NOT necessarily writing SIMD instructions. I.e. GPU SIMD is not the same as CPU SIMD - i.e. not the same as writing code to take advantage of x86 SSE (Stream SIMD Extensions), etc. Indeed, as one of the people who brough CPU SIMD to you (I was heavily involved in Intel MMX, one of the earliest such, and have followed the evolution to FP SIMD) I often feel obliged to correct people who say that CPU's like Intel have SIMD instructions. I prefer to consider them packed vector instructions, although I grudgingly call them SIMD packed vector instruction sets just because everyone misuses the name. I also emphasize that CPU SIMD isntruction sets such as MMX and SSE may have SIMD packed vector execution units - integer and floating point ALUs, etc. - but they don't have SIMD control flow, and they usually don't have SIMD memory access (aka scatter/gather (although Intel Larrabee was moving in that direction)).

Some pages on my comp-arch.net wiki about this (I write about computer architecture for my hobby):
- http://semipublic.comp-arch.net/wiki/SIMD
- http://semipublic.comp-arch.net/wiki/SIMD_packed_vector
- http://semipublic.comp-arch.net/wiki/Difference_between_vector_and_packed_vector
- http://semipublic.comp-arch.net/wiki/Single_Instruction_Multiple_Threads_(SIMT)
although I apologize for not yet having written the page that talks about SIMD packed vector instruction sers, as in Intel MMX or SIMD.

But I don't expect you to read all of the above. Let me try to explain.

Imagine that you have a piece of code that looks something like this, when written in a simple, scalar, manner:

// operating on an array with one million 32b floating point elements A[1000000]
for i from 0 upto 999999 do
     if some_condition(A[i]) then
           A[i] = function1(A[i])
     else
           A[i] = function2(A[i])

where function1() and function2() are simple enough to inline - say function1(x) = x*x and function2(x) = sqrt(x).

On a CPU. to use something like SSE, you would have to (1) divide the array up into chunks, say the size of the 256 bit AVX , (2) handle the IF statement yourself, using masks or the like. Something like:

for i from 0 upto 999999 by 8 do
     register tmp256b_1 = load256b(&A[i])
     register tmp256b_2 = tmp256b_1 * tmp256b_1
     register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
                                                 // a function, possibly inlined
                                                 // doing a Newton Raphson to evaluate sqrt.
     register mask256b = ... code that arranges for you to have 32 1s in the "lane" 
                         where some_condition is true, and 0s elsewhere...
     register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
     store256b(&A[i],tmp256b_4)

You may not think this is so bad, but remember, this is a simple example. Imagine multiple nested IFs, and so on. Or, imagine that "some_condition" is clumpy, so that you might save a lot of unnecessary computation by skipping sections where it is all function1 or all function2...

for i from 0 upto 999999 by 8 do
     register mask256b = ... code that arranges for you to have 32 1s in the "lane" 
                         where some_condition is true, and 0s elsewhere...
     register tmp256b_1 = load256b(A[i])
     if mask256b == ~0 then
         register tmp256b_2 = tmp256b_1 * tmp256b_1
         store256b(&A[i],tmp256b_2)
     else mask256b == 0 then
         register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
         store256b(&A[i],tmp256b_3)
     else
         register tmp256b_1 = load256b(&A[i])
         register tmp256b_2 = tmp256b_1 * tmp256b_1
         register tmp256b_3 = _mm_sqrt_ps(tmp256b_1)
         register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
         store256b(&A[i],tmp256b_4)

I think you can get the picture? And it gets even more complicated when you have multiple arrays, and sometimes the data is aligned on a 256 bit boundary, and sometimes not (as is typical, say, in stencil computations, where you operate on all alignments).

Now, here's roughly what it looks like on something like a GPU:

// operating on an array with one million 32b floating point elements A[1000000]
for all i from 0 upto 999999 do
     if some_condition(A) then
           A = function1(A)
     else
           A = function2(A)

Doesn't that look a lot more like the original scalar code? The only real difference is that you have lost the array indexes, A[i]. (Actually, some GPGPU languages keep the array indexes in, but most that I know of do not.)

Now, I have left out (a) Open/CL's C-like syntax, (b) all of the setup that you need to connect the Open/CL code to your C or C++ code (there are much better languages than CUDA or OpenCL - these have a lot of cruft. But they are available many places, on both CPUs and GPUs[**]). But I think I have presented the heart of the matter:

The key thing about GPGPU computation is that you write SIMD, data parallel cold. But you write it at a higher level than you write CPU-style SSE code. Higher level even than the compiler intrinsics.

First, the GPGPU compiler, e.g. the OpenCL or CUDA compiler, handle a lot of the management of data behind your back. The compiler arranges to do the control flow, tghe IF statements, etc.

By the way, note, as I marked with a [**], that sometimes a so called SIMD GPGPU compiler can generate code that wil run on both CPUs and GPUs. I.e. a SIMD compiler can generate code that uses CPU SIMD instructiin sets.

But GPUs themselves have special hardware support that runs this SIMD code, appropriately compiled, much faster than it can run on the CPU using CPU SIMD instructions. Most importantly, the GPUs have many more execution units - e.g. a CPU like AMD Bulldoser has 2 sets of 128-bit-wide FMACS, i.e. is capable of doing 8 FMACs per cycle. Times the number of CPUs on a chip - say 8 - giving you maybe 64 per cycle. Whereas a modern GPU may have 2,048 32b FMACs every cycle. Even if running at 1/2 or 1/4 the clock rate, that's a big difference.

How can the GPUs have so much more hardware? Well, first, they are usually bigger chips than the CPU. But, also, they tend not to spend (some say "waste") hardware on things like big caches and out-of-order execution that CPUs spend it on. CPUs try to make one or a few computations fast, whereas GPUs do many computations in parallel, but individually slower than the CPU. Still, the total number of computations that the GPU can do per second is much higher than a CPU can do.

The FGPUs have other hardware optimizations. For example, they run many more threads than a CPU. Whereas an Intel CPU has 2 hyperthreads per CPU, giving you 16 threads on an 8 CPU core chip, a GPU may have hundreds. And so on.

Most interesting to me as a computer architect, many GPUs have special hardware support for SIMD control flow. They make manipulating those masks much more efficient than on a CPU running SSE.

And so on.

Anyway, I hope that I have made my point

While you do have to write SIMD code to run on a GPGPU system (like OpenCL).
You should not confuse this sort of SIMD with the SIMD code you have to write to take advantage of Intel SSE.

It's much cleaner.

More and more compilers are allowing the same code to run on both DCPU and GPU. I.e. they are increasingly supporting the clean "real SIMD" coding style, rather than the fake "pseudo-SIMD" coding style that has been necessary to take advantage of MMX and SSE and AVX up til now. This is good - such code is equally "nice" to program on both CPU and GPU. But the GPU often runs it much faster. There's a paper by Intel called "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU", http://www.hwsw.hu/kepek/hirek/2010/06/p451-lee.pdf. It says GPUs are "only" 2.5X faster on average. But that's after a lot of aggressive optimization. The GPU code is often easier to write. And I don't know about you, but I think "only" 2.5X faster is not that much to sneeze at. Especially since the GPGPU code is often easier to read.

Now, there's no free lunch. If your code is naturally data parallel, great. But some coede is not. That can be a pain.

And, like all machines, GPUs have their quirks.

But if your code is naturally data parallel, you may get great speedups, with code that is much more readable.

I'm a CPU designer. I expect to borrow lots of ideas from GPUs to male CPUs run faster, and vice versa.

回复收藏 0 原文

~没有更多了~