如何实现每个周期理论最大 4 次 FLOP?
如何在现代 x86-64 Intel CPU 上实现每周期 4 次浮点运算(双精度)的理论峰值性能?
据我了解 SSE add
需要三个周期在大多数现代 Intel CPU 上,mul
需要五个周期才能完成(例如,请参见 阿格纳雾的“指令表”)。由于流水线,如果算法至少具有三个独立求和,则每个周期可以获得一次add
的吞吐量。由于对于打包 addpd
以及标量 addsd
版本都是如此,并且 SSE 寄存器可以包含两个 double
,因此吞吐量可以是每个周期最多有两次失败。
此外,似乎(尽管我没有看到任何相关的文档)add
和 mul
可以并行执行,理论最大吞吐量为 4每个周期的失败次数。
然而,我无法用简单的 C/C++ 程序复制这种性能。我的最佳尝试结果是大约 2.7 次失败/周期。如果有人可以贡献一个简单的 C/C++ 或汇编程序来展示最佳性能,我们将不胜感激。
我的尝试:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
double stoptime(void) {
struct timeval t;
gettimeofday(&t,NULL);
return (double) t.tv_sec + t.tv_usec/1000000.0;
}
double addmul(double add, double mul, int ops){
// Need to initialise differently otherwise compiler might optimise away
double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
int loops=ops/10; // We have 10 floating point operations inside the loop
double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
+ pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);
for (int i=0; i<loops; i++) {
mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
}
return sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}
int main(int argc, char** argv) {
if (argc != 2) {
printf("usage: %s <num>\n", argv[0]);
printf("number of operations: <num> millions\n");
exit(EXIT_FAILURE);
}
int n = atoi(argv[1]) * 1000000;
if (n<=0)
n=1000;
double x = M_PI;
double y = 1.0 + 1e-8;
double t = stoptime();
x = addmul(x, y, n);
t = stoptime() - t;
printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (double)n/t/1e9, x);
return EXIT_SUCCESS;
}
编译为:
g++ -O2 -march=native addmul.cpp ; ./a.out 1000
在 Intel Core i5-750, 2.66 GHz 上产生以下输出:
addmul: 0.270 s, 3.707 Gflops, res=1.326463
也就是说,每个周期大约 1.4 次浮点运算。查看汇编代码 g++ -S -O2 -march=native -masm=intel addmul.cpp
主循环似乎是 对我来说是最佳的。
.L4:
inc eax
mulsd xmm8, xmm3
mulsd xmm7, xmm3
mulsd xmm6, xmm3
mulsd xmm5, xmm3
mulsd xmm1, xmm3
addsd xmm13, xmm2
addsd xmm12, xmm2
addsd xmm11, xmm2
addsd xmm10, xmm2
addsd xmm9, xmm2
cmp eax, ebx
jne .L4
使用打包版本(addpd
和 mulpd
)更改标量版本将使触发器计数加倍,而不会改变执行时间,因此每个周期的触发器数量将少于 2.8 次。有没有一个简单的例子可以实现每个周期四次触发器?
Mysticial 的不错的小程序;这是我的结果(不过只运行几秒钟):
gcc -O2 -march=nocona
: 5.6 Gflops out of 10.66 Gflops (2.1 flops/cycle)cl /O2
,openmp 已删除:10.66 Gflops 中的 10.1 Gflops(3.8 flops/周期)
这一切似乎有点复杂,但我的结论是far:
gcc -O2
改变独立浮点运算的顺序 交替的目的 如果可能的话,使用addpd
和mulpd
。同样适用于gcc-4.6.2 -O2 -march=core2
。gcc -O2 -march=nocona
似乎保留了中定义的浮点运算的顺序 C++ 源代码。cl /O2
,来自 适用于 Windows 7 的 SDK 自动循环展开并且似乎尝试和安排操作 这样三个addpd
的组与三个mulpd
交替(嗯,至少在我的系统和我的简单程序中)。我的Core i5 750 (Nehalem 架构) 不喜欢交替使用 add 和 mul,并且似乎无法 并行运行这两个操作。然而,如果以 3 为一组,它就会突然变得神奇起来。
其他架构(可能是 Sandy Bridge 等)似乎 能够毫无问题地并行执行 add/mul 如果它们在汇编代码中交替出现。
虽然很难承认,但在我的系统上,
cl /O2
在为我的系统进行低级优化操作方面做得更好,并且为上面的小 C++ 示例实现了接近峰值的性能。我测量之间 1.85-2.01 flops/cycle(在Windows中使用了clock(),这不是那么精确。我想,需要使用更好的计时器 - 感谢Mackie Messer)。我用
gcc
管理的最好的方法是手动循环展开和排列 三人一组的加法和乘法。和g++ -O2 -march=nocona addmul_unroll.cpp
我最多得到0.207s, 4.825 Gflops
,相当于 1.8 flops/cycle 我现在对此很满意。
在 C++ 代码中,我将 for
循环替换为:
for (int i=0; i<loops/3; i++) {
mul1*=mul; mul2*=mul; mul3*=mul;
sum1+=add; sum2+=add; sum3+=add;
mul4*=mul; mul5*=mul; mul1*=mul;
sum4+=add; sum5+=add; sum1+=add;
mul2*=mul; mul3*=mul; mul4*=mul;
sum2+=add; sum3+=add; sum4+=add;
mul5*=mul; mul1*=mul; mul2*=mul;
sum5+=add; sum1+=add; sum2+=add;
mul3*=mul; mul4*=mul; mul5*=mul;
sum3+=add; sum4+=add; sum5+=add;
}
现在程序集如下所示:
.L4:
mulsd xmm8, xmm3
mulsd xmm7, xmm3
mulsd xmm6, xmm3
addsd xmm13, xmm2
addsd xmm12, xmm2
addsd xmm11, xmm2
mulsd xmm5, xmm3
mulsd xmm1, xmm3
mulsd xmm8, xmm3
addsd xmm10, xmm2
addsd xmm9, xmm2
addsd xmm13, xmm2
...
How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU?
As far as I understand it takes three cycles for an SSE add
and five cycles for a mul
to complete on most of the modern Intel CPUs (see for example Agner Fog's 'Instruction Tables' ). Due to pipelining, one can get a throughput of one add
per cycle, if the algorithm has at least three independent summations. Since that is true for both packed addpd
as well as the scalar addsd
versions and SSE registers can contain two double
's, the throughput can be as much as two flops per cycle.
Furthermore, it seems (although I've not seen any proper documentation on this) add
's and mul
's can be executed in parallel giving a theoretical max throughput of four flops per cycle.
However, I've not been able to replicate that performance with a simple C/C++ programme. My best attempt resulted in about 2.7 flops/cycle. If anyone can contribute a simple C/C++ or assembler programme which demonstrates peak performance, that'd be greatly appreciated.
My attempt:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
double stoptime(void) {
struct timeval t;
gettimeofday(&t,NULL);
return (double) t.tv_sec + t.tv_usec/1000000.0;
}
double addmul(double add, double mul, int ops){
// Need to initialise differently otherwise compiler might optimise away
double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
int loops=ops/10; // We have 10 floating point operations inside the loop
double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
+ pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);
for (int i=0; i<loops; i++) {
mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
}
return sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}
int main(int argc, char** argv) {
if (argc != 2) {
printf("usage: %s <num>\n", argv[0]);
printf("number of operations: <num> millions\n");
exit(EXIT_FAILURE);
}
int n = atoi(argv[1]) * 1000000;
if (n<=0)
n=1000;
double x = M_PI;
double y = 1.0 + 1e-8;
double t = stoptime();
x = addmul(x, y, n);
t = stoptime() - t;
printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (double)n/t/1e9, x);
return EXIT_SUCCESS;
}
Compiled with:
g++ -O2 -march=native addmul.cpp ; ./a.out 1000
produces the following output on an Intel Core i5-750, 2.66 GHz:
addmul: 0.270 s, 3.707 Gflops, res=1.326463
That is, just about 1.4 flops per cycle. Looking at the assembler code withg++ -S -O2 -march=native -masm=intel addmul.cpp
the main loop seems kind of
optimal to me.
.L4:
inc eax
mulsd xmm8, xmm3
mulsd xmm7, xmm3
mulsd xmm6, xmm3
mulsd xmm5, xmm3
mulsd xmm1, xmm3
addsd xmm13, xmm2
addsd xmm12, xmm2
addsd xmm11, xmm2
addsd xmm10, xmm2
addsd xmm9, xmm2
cmp eax, ebx
jne .L4
Changing the scalar versions with packed versions (addpd
and mulpd
) would double the flop count without changing the execution time and so I'd get just short of 2.8 flops per cycle. Is there a simple example which achieves four flops per cycle?
Nice little programme by Mysticial; here are my results (run just for a few seconds though):
gcc -O2 -march=nocona
: 5.6 Gflops out of 10.66 Gflops (2.1 flops/cycle)cl /O2
, openmp removed: 10.1 Gflops out of 10.66 Gflops (3.8 flops/cycle)
It all seems a bit complex, but my conclusions so far:
gcc -O2
changes the order of independent floating point operations with
the aim of alternatingaddpd
andmulpd
's if possible. Same applies togcc-4.6.2 -O2 -march=core2
.gcc -O2 -march=nocona
seems to keep the order of floating point operations as defined in
the C++ source.cl /O2
, the 64-bit compiler from the
SDK for Windows 7
does loop-unrolling automatically and seems to try and arrange operations
so that groups of threeaddpd
's alternate with threemulpd
's (well, at least on my system and for my simple programme).My Core i5 750 (Nehalem architecture)
doesn't like alternating add's and mul's and seems unable
to run both operations in parallel. However, if grouped in 3's, it suddenly works like magic.Other architectures (possibly Sandy Bridge and others) appear to
be able to execute add/mul in parallel without problems
if they alternate in the assembly code.Although difficult to admit, but on my system
cl /O2
does a much better job at low-level optimising operations for my system and achieves close to peak performance for the little C++ example above. I measured between
1.85-2.01 flops/cycle (have used clock() in Windows which is not that precise. I guess, need to use a better timer - thanks Mackie Messer).The best I managed with
gcc
was to manually loop unroll and arrange
additions and multiplications in groups of three. Withg++ -O2 -march=nocona addmul_unroll.cpp
I get at best0.207s, 4.825 Gflops
which corresponds to 1.8 flops/cycle
which I'm quite happy with now.
In the C++ code I've replaced the for
loop with:
for (int i=0; i<loops/3; i++) {
mul1*=mul; mul2*=mul; mul3*=mul;
sum1+=add; sum2+=add; sum3+=add;
mul4*=mul; mul5*=mul; mul1*=mul;
sum4+=add; sum5+=add; sum1+=add;
mul2*=mul; mul3*=mul; mul4*=mul;
sum2+=add; sum3+=add; sum4+=add;
mul5*=mul; mul1*=mul; mul2*=mul;
sum5+=add; sum1+=add; sum2+=add;
mul3*=mul; mul4*=mul; mul5*=mul;
sum3+=add; sum4+=add; sum5+=add;
}
And the assembly now looks like:
.L4:
mulsd xmm8, xmm3
mulsd xmm7, xmm3
mulsd xmm6, xmm3
addsd xmm13, xmm2
addsd xmm12, xmm2
addsd xmm11, xmm2
mulsd xmm5, xmm3
mulsd xmm1, xmm3
mulsd xmm8, xmm3
addsd xmm10, xmm2
addsd xmm9, xmm2
addsd xmm13, xmm2
...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我以前完成过这个任务。但主要是为了测量功耗和CPU温度。以下代码(相当长)在我的 Core i7 2600K 上实现了接近最佳的效果。
这里需要注意的关键是大量的手动循环展开以及乘法和加法的交错...
完整的项目可以在我的 GitHub 上找到:https://github.com/Mysticial/Flops
警告:
如果您决定编译并运行此程序,请注意您的 CPU 温度!!!
Make确保你不会让它过热。并确保 CPU 限制不会影响您的结果!
此外,我对运行此代码可能导致的任何损害不承担任何责任。
注释:
ICC 11(英特尔编译器 11)令人惊讶地难以对其进行良好编译。
输出(1 个线程,10000000 次迭代) - 使用 Visual Studio 2010 SP1 编译 - x64 版本:
该机器是 Core i7 2600K @ 4.4 GHz。理论 SSE 峰值为 4 flops * 4.4 GHz = 17.6 GFlops。这段代码实现了 17.3 GFlops - 不错。
输出(8 个线程,10000000 次迭代) - 使用 Visual Studio 2010 SP1 编译 - x64 版本:
理论 SSE 峰值为 4 flops * 4 核 * 4.4 GHz = 70.4 GFlops。实际是65.5 GFlops。
让我们更进一步。 AVX...
输出(1 个线程,10000000 次迭代) - 使用 Visual Studio 2010 SP1 编译 - x64 版本:
理论 AVX 峰值为 8 flops * 4.4 GHz = 35.2 GFlops。实际值为 33.4 GFlops。
输出(8 个线程,10000000 次迭代) - 使用 Visual Studio 2010 SP1 编译 - x64 版本:
理论 AVX 峰值为 8 flops * 4 核 * 4.4 GHz = 140.8 GFlops。实际为138.2 GFlops。
现在进行一些解释:
性能关键部分显然是内部循环内的 48 条指令。您会注意到它被分成 4 个块,每个块有 12 条指令。这 12 个指令块中的每一个都完全相互独立 - 平均需要 6 个周期来执行。
因此,从发布到使用之间有 12 条指令和 6 个周期。乘法的延迟为 5 个周期,因此足以避免延迟停顿。
需要标准化步骤来防止数据上溢/下溢。这是必需的,因为什么都不做的代码会慢慢增加/减少数据的大小。
因此,如果您只使用全零并取消标准化步骤,实际上可以做得更好。然而,由于我编写了基准测试来测量功耗和温度,我必须确保触发器是“真实”数据,而不是零 - 因为执行单元很可能有特殊情况 -处理使用更少功率并产生更少热量的零。
更多结果:
线程:1
理论 SSE 峰值:4 flops * 3.5 GHz = 14.0 GFlops。实际值为 13.3 GFlops。
线程:8
理论 SSE 峰值:4 次浮点运算 * 4 个内核 * 3.5 GHz = 56.0 GFlops。实际值为 51.3 GFlops。
我的处理器温度在多线程运行时达到 76C!如果您运行这些,请确保结果不受 CPU 限制的影响。
线程:1
理论 SSE 峰值:4 次浮点运算 * 3.2 GHz = 12.8 GFlops。实际值为 12.3 GFlops。
线程:8
理论 SSE 峰值:4 个触发器 * 8 个内核 * 3.2 GHz = 102.4 GFlops。实际值为 97.9 GFlops。
I've done this exact task before. But it was mainly to measure power consumption and CPU temperatures. The following code (which is fairly long) achieves close to optimal on my Core i7 2600K.
The key thing to note here is the massive amount of manual loop-unrolling as well as interleaving of multiplies and adds...
The full project can be found on my GitHub: https://github.com/Mysticial/Flops
Warning:
If you decide to compile and run this, pay attention to your CPU temperatures!!!
Make sure you don't overheat it. And make sure CPU-throttling doesn't affect your results!
Furthermore, I take no responsibility for whatever damage that may result from running this code.
Notes:
ICC 11 (Intel Compiler 11) surprisingly has trouble compiling it well.
Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:
The machine is a Core i7 2600K @ 4.4 GHz. Theoretical SSE peak is 4 flops * 4.4 GHz = 17.6 GFlops. This code achieves 17.3 GFlops - not bad.
Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:
Theoretical SSE peak is 4 flops * 4 cores * 4.4 GHz = 70.4 GFlops. Actual is 65.5 GFlops.
Let's take this one step further. AVX...
Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:
Theoretical AVX peak is 8 flops * 4.4 GHz = 35.2 GFlops. Actual is 33.4 GFlops.
Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:
Theoretical AVX peak is 8 flops * 4 cores * 4.4 GHz = 140.8 GFlops. Actual is 138.2 GFlops.
Now for some explanations:
The performance critical part is obviously the 48 instructions inside the inner loop. You'll notice that it's broken into 4 blocks of 12 instructions each. Each of these 12 instructions blocks are completely independent from each other - and take on average 6 cycles to execute.
So there's 12 instructions and 6 cycles between issue-to-use. The latency of multiplication is 5 cycles, so it's just enough to avoid latency stalls.
The normalization step is needed to keep the data from over/underflowing. This is needed since the do-nothing code will slowly increase/decrease the magnitude of the data.
So it's actually possible to do better than this if you just use all zeros and get rid of the normalization step. However, since I wrote the benchmark to measure power consumption and temperature, I had to make sure the flops were on "real" data, rather than zeros - as the execution units may very well have special case-handling for zeros that use less power and produce less heat.
More Results:
Threads: 1
Theoretical SSE Peak: 4 flops * 3.5 GHz = 14.0 GFlops. Actual is 13.3 GFlops.
Threads: 8
Theoretical SSE Peak: 4 flops * 4 cores * 3.5 GHz = 56.0 GFlops. Actual is 51.3 GFlops.
My processor temps hit 76C on the multi-threaded run! If you runs these, be sure the results aren't affected by CPU throttling.
Threads: 1
Theoretical SSE Peak: 4 flops * 3.2 GHz = 12.8 GFlops. Actual is 12.3 GFlops.
Threads: 8
Theoretical SSE Peak: 4 flops * 8 cores * 3.2 GHz = 102.4 GFlops. Actual is 97.9 GFlops.
Intel 架构中有一点人们经常忘记,调度端口在 Int 和 FP/SIMD 之间共享。这意味着在循环逻辑在浮点流中产生气泡之前,您只能获得一定数量的 FP/SIMD 突发。 Mystical 从他的代码中得到了更多的失败,因为他在展开的循环中使用了更长的步幅。
如果你在这里查看 Nehalem/Sandy Bridge 架构
http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937& ;p=6
发生的事情很清楚。
相比之下,在 AMD (Bulldozer) 上应该更容易达到峰值性能,因为 INT 和 FP/SIMD 管道具有带有自己的调度程序的单独发出端口。
这只是理论上的,因为我没有这些处理器可供测试。
There's a point in the Intel architecture that people often forget, the dispatch ports are shared between Int and FP/SIMD. This means that you will only get a certain amount of bursts of FP/SIMD before the loop logic will create bubbles in your floating point stream. Mystical got more flops out of his code, because he used longer strides in his unrolled loop.
If you look at the Nehalem/Sandy Bridge architecture here
http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
it's quite clear what happens.
In contrast, it should be easier to reach peak performance on AMD (Bulldozer) as the INT and FP/SIMD pipes have separate issue ports with their own scheduler.
This is only theoretical as I have neither of these processors to test.
分支肯定会让你无法维持峰值理论性能。如果您手动进行一些循环展开,您会发现有什么不同吗?例如,如果每次循环迭代放置 5 或 10 倍的操作:
Branches can definitely keep you from sustaining peak theoretical performance. Do you see a difference if you manually do some loop-unrolling? For example, if you put 5 or 10 times as many ops per loop iteration:
在 2.4GHz Intel Core 2 Duo 上使用 Intels icc 版本 11.1,我得到的
结果非常接近理想的 9.6 Gflops。
编辑:
哎呀,看看汇编代码,似乎 icc 不仅对乘法进行了矢量化,而且还将加法从循环中拉出来。强制更严格的 fp 语义,代码不再矢量化:
EDIT2:
根据要求:
clang 代码的内部循环如下所示:
EDIT3:
最后,两个建议:首先,如果您喜欢这种类型的基准测试,请考虑使用
rdtsc
指令而不是gettimeofday(2)
。它更加准确,并且以周期的形式提供时间,这通常是您感兴趣的。对于 gcc 和朋友,您可以这样定义:其次,您应该多次运行基准测试程序并仅使用最佳性能。在现代操作系统中,许多事情是并行发生的,CPU可能处于低频省电模式等。重复运行程序会给你一个更接近理想情况的结果。
Using Intels icc Version 11.1 on a 2.4GHz Intel Core 2 Duo I get
That is very close to the ideal 9.6 Gflops.
EDIT:
Oops, looking at the assembly code it seems that icc not only vectorized the multiplication, but also pulled the additions out of the loop. Forcing a stricter fp semantics the code is no longer vectorized:
EDIT2:
As requested:
The inner loop of clang's code looks like this:
EDIT3:
Finally, two suggestions: First, if you like this type of benchmarking, consider using the
rdtsc
instruction istead ofgettimeofday(2)
. It is much more accurate and delivers the time in cycles, which is usually what you are interested in anyway. For gcc and friends you can define it like this:Second, you should run your benchmark program several times and use the best performance only. In modern operating systems many things happen in parallel, the cpu may be in a low frequency power saving mode, etc. Running the program repeatedly gives you a result that is closer to the ideal case.