什么是 FLOP/s？它是衡量性能的一个很好的指标吗？

发布于 2024-07-10 02:27:28 字数 461 浏览 14 评论 0原文

我被要求测量一个在多 CPU 系统上求解微分方程的 Fortran 程序的性能。我的雇主坚持要求我测量 FLOP/s（每秒浮动操作数）并将结果与基准进行比较 (LINPACK) 但我不相信这是正确的方法，因为没有人能给我解释什么是 FLOP。

我对 FLOP 到底是什么做了一些研究，得到了一些非常矛盾的答案。我得到的最受欢迎的答案之一是“1 FLOP = 加法和乘法运算”。真的吗？如果是这样，那么从物理上来说，这到底意味着什么？

无论我最终使用什么方法，它都必须是可扩展的。代码的某些版本解决了具有数百万个未知数的系统，并且需要几天的时间才能执行。

在我的案例中，还有哪些其他有效的衡量性能的方法（我的案例摘要是“fortran 代码在数百个 CPU 上反复进行数天的大量算术计算）”？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

衣神在巴黎 2024-07-17 02:27:28

“将结果与基准进行比较”然后做什么？

需要

FLOPS 意味着每个工作单元

1) FLOPs。 2) 该工作单元的时间。

假设您有一些输入文件通过某个循环执行 1,000 次迭代。循环是一个方便的工作单元。它被执行 1,000 次。需要一个小时。

该循环有一些加法和乘法以及一些除法和平方根。您可以计算加法、乘法和除法。您可以在源代码中对此进行计数，查找 +、* 和 /。您可以找到编译器的汇编语言输出，并在那里对它们进行计数。您可能会得到不同的数字。哪一个是正确的？问问你的老板。

您可以计算平方根，但您不知道它在乘法和加法方面的真正作用。因此，您必须执行诸如基准乘法与平方根之类的操作才能了解平方根需要多长时间。

现在您知道循环中的 FLOPS 了。并且您知道运行 1,000 次所需的时间。你知道每秒的 FLOPS 数。

然后你查看 LINPACK，发现你的速度变慢了。怎么办？你的程序不是 LINPACK，而且它比 LINPACK 慢。您的代码很有可能会变慢。除非您的代码是在与 LINPACK 相同的年数内编写和优化的，否则您的速度会更慢。

这是另一部分。您的处理器针对各种基准有一些定义的 FLOPS 评级。您的算法不是这些基准之一，因此您未达到基准。这很糟糕吗？或者这是不成为基准的明显后果？

可行的结果是什么？

针对某些基准代码库的测量只会告诉您您的算法不是基准算法。你会与众不同，这是已成定局；通常较慢。

显然，针对 LINPACK 进行测量的结果将是 (a) 你与众不同，因此 (b) 你需要优化。

只有针对你自己进行测量时，测量才真正有价值。不是一些假设的指令组合，而是您自己的指令组合。衡量你自己的表现。做出改变。看看你的表现——与你自己相比——是变得更好还是更差。

失败并不重要。重要的是每个工作单元的时间。您永远无法匹配硬件的设计参数，因为您没有运行硬件设计人员期望的基准测试。

LINPACK 没关系。重要的是您的代码库以及您为改变性能而所做的更改。

"compare the results with benchmarks" and do what?

FLOPS means you need

1) FLOPs per some unit of work.

2) time for that unit of work.

Let's say you have some input file that does 1,000 iterations through some loop. The loop is a handy unit of work. It gets executed 1,000 times. It takes an hour.

The loop has some adds and multiplies and a few divides and a square root. You can count adds, multiplies and divides. You can count this in the source, looking for +, * and /. You can find the assembler-language output from the compiler, and count them there, too. You may get different numbers. Which one is right? Ask your boss.

You can count the square roots, but you don't know what it really does in terms of multiplies and adds. So, you'll have to do something like benchmark multiply vs. square root to get a sense of how long a square root takes.

Now you know the FLOPS in your loop. And you know the time to run it 1,000 times. You know FLOPS per second.

Then you look at LINPACK and find you're slower. Now what? Your program isn't LINPACK, and it's slower than LINPACK. Odds are really good that your code will be slower. Unless your code was written and optimized over the same number of years a LINPACK, you'll be slower.

Here's the other part. Your processor has some defined FLOPS rating against various benchmarks. Your algorithm is not one of those benchmarks, so you fall short of the benchmarks. Is this bad? Or is this the obvious consequence of not being a benchmark?

What's the actionable outcome going to be?

Measurement against some benchmark code base is only going to tell you that you're algorithm isn't the benchmark algorithm. It's a foregone conclusion that you'll be different; usually slower.

Obviously, the result of measuring against LINPACK will be (a) you're different and therefore (b) you need to optimize.

Measurement is only really valuable when done against yourself. Not some hypothetical instruction mix, but your own instruction mix. Measure your own performance. Make a change. See if your performance -- compared with yourself -- get better or worse.

FLOPS don't matter. What matters is time per unit of work. You'll never match the design parameters of your hardware because you're not running the benchmark that your hardware designers expected.

LINPACK doesn't matter. What matters is your code base and the changes you're making to change performance.

回复收藏 0 原文

百善笑为先 2024-07-17 02:27:28

正如您所说，FLOPS 是每秒的浮点运算。举个例子，如果您执行一次操作（例如对两个值进行加法、减法、乘法或除法并返回结果）正好花费一秒钟，那么您的性能仅为 1 FLOPS。最近的CPU 可以轻松实现数GigaFLOPS，即每秒数十亿次浮点运算。

回复收藏 0 原文

海之角 2024-07-17 02:27:28

只要您准确理解它衡量的内容，这就是一个相当不错的性能衡量标准。

顾名思义，FLOPS 是每秒的浮点数操作数，FLOP 的具体构成可能因 CPU 的不同而异。（例如，某些 CPU 可以将加法和乘法作为一项运算执行，而其他 CPU 则不能）。这意味着作为一种性能衡量标准，它与硬件相当接近，这意味着 1) 您必须了解您的硬件才能在给定架构上计算理想的 FLOPS，并且您必须了解您的算法和实现以弄清楚如何计算它实际上包含许多浮点运算。

无论如何，它都是检查 CPU 利用率的有用工具。如果您知道 CPU 的理论峰值性能（以 FLOPS 为单位），您就可以计算出使用 CPU 浮点单元的效率，而浮点单元通常是最难有效利用的单元之一。运行 CPU 能够执行的 FLOPS 30% 的程序有优化的空间。除非您更改基本算法，否则以 70% 运行的效率可能不会提高很多。对于像您这样的数学密集型算法，这几乎是衡量性能的标准方法。您可以简单地测量程序运行所需的时间，但这根据 CPU 的不同而有很大差异。但是，如果您的程序有 50% 的 CPU 利用率（相对于峰值 FLOPS 计数），那么这是一个更恒定的值（在完全不同的 CPU 架构之间它仍然会有所不同，但它比执行时间更一致）。

但是，知道“我的 CPU 能够实现 X GFLOPS，而我实际上只实现了其中的 20%”，这在高性能软件中是非常有价值的信息。这意味着除浮点运算之外的其他东西会阻碍您，并阻止 FP 单元有效工作。由于 FP 单元构成了大部分工作，这意味着您的软件有问题。

衡量“我的程序在 X 分钟内运行”很容易，如果您觉得这是不可接受的，那么当然，您可以说“我想知道我是否可以削减 30%”，但您不知道如果这是可能的，除非您准确计算出正在完成多少工作，以及 CPU 在峰值时的能力。如果您甚至不知道 CPU 从根本上是否能够每秒运行更多指令，那么您愿意花多少时间来优化它？

由于 FP 操作之间存在过多的依赖关系，或者具有过多的分支或类似情况，从而妨碍高效调度，因此很容易阻止 CPU 的 FP 单元被有效利用。如果这就是阻碍您实现的原因，您需要知道这一点。您需要知道“我没有获得应有的 FP 吞吐量，因此很明显，当 CPU 准备好发出 FP 指令时，我的代码的其他部分正在阻止 FP 指令可用”。

为什么需要其他方法来衡量绩效？仅仅按照老板的要求计算出 FLOPS 计数有什么问题吗？ ;)

It's a pretty decent measure of performance, as long as you understand exactly what it measures.

FLOPS is, as the name implies FLoating point OPerations per Second, exactly what constitutes a FLOP might vary by CPU. (Some CPU's can perform addition and multiplication as one operation, others can't, for example). That means that as a performance measure, it is fairly close to the hardware, which means that 1) you have to know your hardware to compute the ideal FLOPS on the given architecture, and you have to know your algorithm and implementation to figure out how many floating point ops it actually consists of.

In any case, it's a useful tool for examining how well you utilize the CPU. If you know the CPU's theoretical peak performance in FLOPS, you can work out how efficiently you use the CPU's floating point units, which are often one of the hard to utilize efficiently. A program which runs 30% of the FLOPS the CPU is capable of, has room for optimization. One which runs at 70% is probably not going to get much more efficient unless you change the basic algorithm. For math-heavy algorithms like yours, that is pretty much the standard way to measure performance. You could simply measure how long a program takes to run, but that varies wildly depending on CPU. But if your program has a 50% CPU utilization (relative to the peak FLOPS count), that is a somewhat more constant value (it'll still vary between radically different CPU architectures, but it's a lot more consistent than execution time).

But knowing that "My CPU is capable of X GFLOPS, and I'm only actually achieving a throughput of, say, 20% of that" is very valuable information in high-performance software. It means that something other than the floating point ops is holding you back, and preventing the FP units from working efficiently. And since the FP units constitute the bulk of the work, that means your software has a problem.

It's easy to measure "My program runs in X minutes", and if you feel that is unacceptable then sure, you can go "I wonder if I can chop 30% off that", but you don't know if that is possible unless you work out exactly how much work is being done, and exactly what the CPU is capable of at peak. How much time do you want to spend optimizing this, if you don't even know whether the CPU is fundamentally capable of running any more instructions per second?

It's very easy to prevent the CPU's FP unit from being utilized efficiently, by having too many dependencies between FP ops, or by having too many branches or similar preventing efficient scheduling. And if that is what is holding your implementation back, you need to know that. You need to know that "I'm not getting the FP throughput that should be possible, so clearly other parts of my code are preventing FP instructions from being available when the CPU is ready to issue one".

Why do you need other ways to measure performance? What's wrong with just working out the FLOPS count as your boss asked you to? ;)

回复收藏 0 原文

笔芯 2024-07-17 02:27:28

我想补充几点：

除法很特殊。由于大多数处理器可以在单个周期内执行加法、比较或乘法，因此这些都被计为一次触发器。但除法总是需要更长的时间。多长时间取决于处理器，但 HPC 社区中有一种事实上的标准，将一次除法计算为 4 次失败。
如果处理器具有融合乘加指令，该指令在一条指令中执行乘法和加法（通常是 A += B * C），则算作 2 次操作。
始终小心区分单精度触发器和双精度触发器。能够处理这么多单精度千兆浮点运算的处理器可能只能处理那么多双精度千兆浮点运算的一小部分。 AMD Athlon 和 Phenom 处理器执行的双精度触发器数量通常是单精度处理器的一半。 ATI Firestream 处理器的双精度触发器执行次数通常是单精度触发器的 1/5。如果有人试图向您出售处理器或软件包，而他们只是引用触发器而没有说明具体是哪个触发器，那么您应该打电话给他们。
术语“兆浮点运算”、“千兆浮点运算”、“兆浮点运算”等很常用。这些指的是 1000 因子，而不是 1024。例如，1 megaflop = 1,000,000 flop/sec，而不是 1,048,576。正如磁盘驱动器大小一样，对此也存在一些混淆。

回复收藏 0 原文

み青杉依旧 2024-07-17 02:27:28

在我看来，老问题和老问题，如果流行的话，答案并不完全好。

“FLOP”是浮点数学运算。 “FLOPS”可以表示以下两种含义之一：

“FLOP”的简单复数形式（即“操作 X 需要 50 个 FLOP”）
第一种意义上的 FLOP 的率（即每秒的浮点数学运算）

如果从上下文中不清楚，通常可以通过将前者写为“FLOPs”而将后者写为“FLOP/s”来消除歧义。

所谓FLOP是为了将它们与其他类型的CPU操作区分开来，例如整数数学运算、逻辑运算、按位运算、内存操作和分支操作，它们具有不同的成本（阅读“采取不同的成本”）时间长度”）与它们相关。

“FLOP 计数”的实践可以追溯到科学计算的早期阶段，相对而言，当时的 FLOP 非常昂贵，每个 FLOP 需要许多 CPU 周期。例如，80387 数学协处理器执行一次乘法需要大约 300 个周期。那是在流水线技术出现之前，CPU 时钟速度和内存速度之间的鸿沟还没有真正显现出来：内存操作只需要一两个周期，而分支（“决策”）同样便宜。那时，如果您可以消除一次 FLOP 来支持十几个内存访问，那么您就取得了收益。如果你能消除一个 FLOP，而有利于十几个分支，那么你就获得了收益。因此，在过去，对 FLOP 进行计数而不用太担心内存引用和分支是有意义的，因为 FLOP 强烈主导执行时间，因为它们相对于其他类型的操作而言非常昂贵。

最近，情况发生了逆转。 FLOP 已经变得非常便宜——任何现代英特尔核心每个周期都可以执行大约两次 FLOP（尽管除法仍然相对昂贵）——而内存访问和分支相对昂贵得多：L1 缓存命中成本可能为 3或 4 个周期，从主内存中获取数据的成本为 150-200。鉴于这种反转，消除 FLOP 以支持内存访问将不再会带来收益；事实上，这不太可能。同样，“只做”一次失败通常比决定是否做要便宜，即使它是多余的。这与25年前的情况几乎完全相反。

不幸的是，盲数 FLOP 计数作为算法优点的绝对衡量标准的做法一直持续到其保质期之后。 现代科学计算更多的是关于内存带宽管理 - 试图保持执行 FLOP 的执行单元不断地接收数据 - 而不是减少 FLOP 的数量。对 LINPACK 的提及（基本上在 20 年前就被 LAPACK 淘汰了）让我怀疑你的雇主可能是一个非常老派的人，还没有内化这一事实建立性能预期不再只是计算 FLOP 的问题。如果求解器具有更有利的内存访问模式和数据布局，则执行两倍次数的 FLOP 的求解器仍可能比另一个求解器快二十倍。

所有这一切的结果是计算密集型软件的性能评估变得比以前复杂得多。由于内存操作和分支成本的巨大变化，FLOP 变得越来越便宜这一事实变得非常复杂。在评估算法时，简单的 FLOP 计数根本无法再提供总体性能预期。

也许所谓的屋顶线模型<提供了一种更好的方式来思考绩效期望和评估/a>，它远非完美，但其优点是让您同时考虑浮点和内存带宽问题之间的权衡，提供更丰富的信息和洞察力“ 2D 图片”，可以比较性能测量和性能预期。

值得一看。

Old question with old, if popular, answers that are not exactly great, IMO.

A “FLOP” is a floating-point math operation. “FLOPS” can mean either of two things:

The simple plural of “FLOP” (i.e. “operation X takes 50 FLOPs”)
The rate of FLOPs in the first sense (i.e. floating-point math operations per second)

Where it is not clear from context, which of these is meant is often disambiguated by writing the former as “FLOPs” and the latter as “FLOP/s”.

FLOPs are so-called to distinguish them from other kinds of CPU operations, such as integer math operations, logical operations, bitwise operations, memory operations, and branching operations, which have different costs (read “take different lengths of time”) associated with them.

The practice of “FLOP counting” dates back to the very early days of scientific computing, when FLOPs were, relatively speaking, extremely expensive, taking many CPU cycles each. An 80387 math coprocessor, for example, took something like 300 cycles for a single multiplication. This was at a time before pipelining and before the gulf between CPU clock speeds and memory speeds had really opened up: memory operations took just a cycle or two, and branching (“decision making”) was similarly cheap. Back then, if you could eliminate a single FLOP in favor of a dozen memory accesses, you made a gain. If you could eliminate a single FLOP in favor of a dozen branches, you made a gain. So, in the past, it made sense to count FLOPs and not worry much about memory references and branches because FLOPs strongly dominated execution time because they were individually very expensive relative to other kinds of operation.

More recently, the situation has reversed. FLOPs have become very cheap — any modern Intel core can perform about two FLOPs per cycle (although division remains relatively expensive) — and memory accesses and branches are comparatively much more expensive: a L1 cache hit costs maybe 3 or 4 cycles, a fetch from main memory costs 150–200. Given this inversion, it is no longer the case that eliminating a FLOP in favor of a memory access will result in a gain; in fact, that's unlikely. Similarly, it is often cheaper to “just do” a FLOP, even if it's redundant, rather than decide whether to do it or not. This is pretty much the complete opposite of the situation 25 years ago.

Unfortunately, the practice of blind FLOP-counting as an absolute metric of algorithmic merit has persisted well past its sell-by date. Modern scientific computing is much more about memory bandwidth management — trying to keep the execution units that do the FLOPs constantly fed with data — than it is about reducing the number of FLOPs. The reference to LINPACK (which was essentially obsoleted by LAPACK 20 years ago) leads me to suspect that your employer is probably of a very old school that hasn't internalized the fact that establishing performance expectations is not just a matter of FLOP counting any more. A solver that does twice as many FLOPs could still be twenty times faster than another if it has a much more favorable memory access pattern and data layout.

The upshot of all this is that performance assessment of computationally intensive software has become a lot more complex than it used to be. The fact that FLOPs have become cheap is hugely complicated by the massive variability in the costs of memory operations and branches. When it comes to assessing algorithms, simple FLOP counting simply doesn't inform overall performance expectations any more.

Perhaps a better way of thinking about performance expectations and assessment is provided by the so-called roofline model, which is far from perfect, but has the advantage of making you think about the trade-off between floating-point and memory bandwidth issues at the same time, providing a more informative and insightful “2D picture” that enables the comparison of performance measurements and performance expectations.

It's worth a look.

回复收藏 0 原文

飞烟轻若梦 2024-07-17 02:27:28

我只是想让它尽可能快地运行，这需要找出它在哪里花费了时间，特别是如果有可以避免的函数调用。

我通过简单的方法来做到这一点，只需在它运行时中断几次，然后看看它在做什么。以下是我发现的一些事情：

大部分时间都在计算导数和/或雅可比行列式的过程中。大部分时间可以用于数学函数调用，例如 exp()、log() 和 sqrt()。通常这些都是用相同的参数重复的，并且可以被记住。（大幅加速。）
大部分时间都花在计算导数上太多次，因为积分容差比必要的更严格。（更快）
如果由于方程被认为是刚性的而使用隐式积分算法（例如 DLSODE Gear），那么很可能并非如此，并且可以使用诸如 Runge-Kutta 之类的算法。（德沃克）。（仍然更快）
如果模型是线性的（DGPADM），则可能可以使用矩阵指数算法。这对于性能和精度来说都是一个巨大的胜利，并且不受刚度的影响。（更快）
在调用堆栈的较高位置，可能会使用略有不同的参数重复执行相同的积分，以便确定解相对于这些参数的前向或中心差分梯度。如果微分方程本身是可微的，则可以通过分析或通过用灵敏度方程增强方程来获得这些梯度。这不仅更快，而且更精确，这可以加快堆栈上层的速度。
在

您可以将堆栈的每个级别视为寻找优化内容的机会，并且加速会复合。然后，当您使用多CPU时，假设它是可并行的，那么它应该提供自己的乘法因子。

回到失败的话题。您可以尝试最大化FLOPs/秒，但最小化FLOPs/run也可能更有用，通过在堆栈的各个级别进行优化。无论如何，仅仅测量它们几乎不能告诉你什么。

I would just try to make it go as fast as possible, and that requires finding out where it is spending time, especially if there are function calls that could be avoided.

I do this by the simple method of just interrupting it a few times while it is running, and seeing what it is doing. Here are the kinds of things I find:

Much of the time it is in the process of computing the derivative and/or Jacobian. Much of this time can go into math function calls such as exp(), log(), and sqrt(). Often these are repeated with identical arguments and can be memo-ized. (Massive speedup.)
Much of the time is spent calculating derivatives too many times because the integration tolerances are tighter than necessary. (Faster)
If an implicit integration algorithm (such as DLSODE Gear) is being used because the equations are thought to be stiff, chances are they are not, and something like Runge-Kutta could be used. (DVERK). (Faster still)
Possibly a matrix-exponent algorithm could be used if the model is linear (DGPADM). This is a big win both for performance and precision, and is immune to stiffness. (Way faster)
Higher up the call-stack, it could be that the same integrations are being performed repeatedly with slightly different parameters, so as to determine a forward or central-difference gradient of the solution with respect to those parameters. If the differential equations are themselves differentiable, it may be possible to get those gradients analytically, or by augmenting the equations with sensitivity equations. This is not only much faster, but much more precise, which can speed things up still higher up the stack.

You can look at each level of the stack as an opportunity to find things to optimize, and the speedups will compound. Then when you go to multi-cpu, assuming it is parallelizable, that should provide its own multiplicative factor.

So back to FLOPs. You could try to maximize FLOPs / second, but it can also be much more useful to minimze FLOPs / run, by optimizing at all levels of the stack. In any case, just measuring them tells you almost nothing.

回复收藏 0 原文

凉风有信 2024-07-17 02:27:28

你的雇主是对的。
衡量 Fortran 程序（或任何其他程序，顺便说一句）有效性的唯一方法是根据标准基准（如果存在）对其进行测试。

而且，关于 FLOPS，它代表“每秒浮点运算” - 请参阅定义维基百科。

回复收藏 0 原文

七分※倦醒 2024-07-17 02:27:28

我认为测量 FLOPS 不会很有用。

实现的 FLOPS 数将告诉您算法占用 CPU 的繁忙程度，但不会告诉您算法本身的执行情况。

您可能会发现两种不同的算法会导致处理器执行相同数量的 FLOPS，但其中一种可以在一半的时间内提供您所需的结果。

我认为您最好查看“更高级别”的统计数据，例如每单位时间求解的微分方程的数量（毕竟，这就是算法的目的）。

另一方面，测量实现的 FLOPS 数可能会帮助您改进算法，因为它会告诉您 CPU 的繁忙程度。

回复收藏 0 原文

一花一树开 2024-07-17 02:27:28

如何测量 T-FLOPS

"(# of parallel GPU processing cores multiplied by peak clock speed in MHz multiplied by two) divided by 1,000,000

公式中的第二个数字源于以下事实：某些 GPU 指令每个周期可以执行两次操作，并且由于 teraFLOP 是 GPU 最大图形潜力的衡量标准，因此我们使用该值公制。

让我们看看如何使用该公式来计算 Xbox One 中的 teraFLOPS。
该系统的集成显卡有768个并行处理核心。 GPU的峰值时钟速度为853MHz。当我们将 768 乘以 853，然后再乘以 2，然后将该数字除以 1,000,000 时，我们得到 1.31 teraFLOPS。”

https://www.gamespot.com/gallery/console-gpu-power-compared-ranking-systems-by-flop/2900 -1334/

2016年GPU价格比较：
“这些是理论性能数据，据我们了解，这些数据通常介于有些乐观和过高十倍之间。因此，这些数据表明实际价格约为 0.03-0.3 美元/GFLOPS。我们收集了单精度和双精度数据，但最便宜的价格相似”。

https://aiimpacts.org/current-flops-prices/

How to Measure T-FLOPS

"(# of parallel GPU processing cores multiplied by peak clock speed in MHz multiplied by two) divided by 1,000,000

The number two in the formula stems from the fact that some GPU instructions can deliver two operations per cycle, and since teraFLOP is a measure of a GPU's maximum graphical potential, we use that metric.

Let's see how we can use that formula to calculate the teraFLOPS in the Xbox One.
The system's integrated graphics has 768 parallel processing cores. The GPU's peak clock speed is 853MHz. When we multiply 768 by 853 and then again by two, and then divide that number by 1,000,000, we get 1.31 teraFLOPS."

https://www.gamespot.com/gallery/console-gpu-power-compared-ranking-systems-by-flop/2900-1334/

Price comparison of GPUs from 2016:
"These are theoretical performance figures, which we understand to generally be between somewhat optimistic and ten times too high. So this data suggests real prices of around $0.03-$0.3/GFLOPS. We collected both single and double precision figures, but the cheapest were similar."

https://aiimpacts.org/current-flops-prices/

回复收藏 0 原文

~没有更多了~