当前位置：文江博客话题详情

FLOPS 什么是真正的 FLOPS

发布于 2024-08-07 00:11:37 字数 808 浏览 17 评论 0原文

我来自这个线程： FLOPS Intel core 并用 C 测试它(innerproduct)

当我开始编写简单的测试脚本时，我想到了几个问题。

为什么是浮点数？浮点数有什么重要之处值得我们考虑？为什么不是简单的 int？
如果我想测量 FLOPS，假设我正在计算两个向量的内积。两个向量必须是 float[] 吗？如果我使用 int[]，测量结果会有什么不同？
我不熟悉英特尔架构。假设我有以下操作：
```
浮点 a = 3.14159;浮点数 b = 3.14158；
for(int i = 0; i < 100; ++i) {
    a + b；
}
```
这是多少个“浮点运算”？
我有点困惑，因为我研究了简化的 32 位 MIPS 架构。对于每条指令，都有 32 位，例如操作数 1 为 5 位，操作数 2 为 5 位等。因此，对于英特尔架构（特别是与上一个线程相同的架构），我被告知寄存器可以保存 128 位。对于单精度浮点，每个浮点数 32 位，这是否意味着对于馈送到处理器的每条指令，它可能需要 4 浮点数？我们是否还必须考虑操作数和指令其他部分中涉及的位？我们怎么能只向CPU提供4个浮点数而没有任何具体含义呢？

我不知道我这种零碎思考一切的方法是否有意义。如果不是，我应该看什么“高度”的视角？

原文

I came from this thread: FLOPS Intel core and testing it with C (innerproduct)

As I began writing simple test scripts, a few questions came into my mind.

Why floating point? What is so significant about floating point that we have to consider? Why not a simple int?
If I want to measure FLOPS, let say I am doing the inner product of two vectors. Must the two vectors be float[] ? How will the measurement be different if I use int[]?
I am not familiar with Intel architectures. Let say I have the following operations:
```
float a = 3.14159; float b = 3.14158;
for(int i = 0; i < 100; ++i) {
    a + b;
}
```
How many "floating point operations" is this?
I am a bit confused because I studied a simplified 32bit MIPS architecture. For every instruction, there is 32 bits, like 5 bit for operand 1 and 5 bit for operand 2 etc. so for intel architectures (specifically the same architecture from the previous thread), I was told that the register can hold 128 bit. For SINGLE PRECISION floating point, 32bit per float point number, does that mean for each instruction fed to the processor, it can take 4
floating point numbers? Don't we also have to account for bits involved in operands and other parts of the instruction? How can we just feed 4 floating point numbers to a cpu without any specific meaning to this?

I don't know whether my approach of thinking everything in bits and pieces make sense. If not, what "height" of perspective should I be looking at?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

破晓 2024-08-14 00:11:37

1.) 浮点运算代表比固定宽度整数更广泛的数学范围。此外，大量数值或科学应用程序（通常是实际测试 CPU 纯计算能力的应用程序）可能比任何东西都更依赖浮点运算。

2.) 它们都必须是浮动的。 CPU不会将整数和浮点数相加，其中之一会被隐式转换（很可能整数会转换为浮点数），因此它仍然只是浮点运算。

3.) 这将是 100 个浮点运算，以及 100 个整数运算，以及一些（100？）控制流/分支/比较运算。通常还会有加载和存储，但您似乎没有存储该值:)

4.) 我不知道如何开始这一点，您似乎对材料有一个总体看法，但您混淆了一些细节。是的，单个指令可以分为类似于以下的部分：

|OP CODE | Operand 1 | Operand 2 | (among many, many others)

但是，操作数 1 和操作数 2 不必包含要添加的实际值。它们可以只包含要添加的寄存器。例如，以这条 SSE 指令为例：

mulps      %%xmm3, %%xmm1

它告诉执行单元将寄存器 xmm3 的内容与 xmm1 的内容相乘，并将结果存储在 xmm3 中。由于寄存器保存 128 位值，因此我对 128 位值进行操作，这与指令的大小无关。不幸的是，x86 由于是 CISC 架构，因此没有与 MIPS 类似的指令分解。 x86 指令可以具有 1 到 16(!) 字节之间的任何位置。

至于你的问题，我认为这些都是非常有趣的知识，它可以帮助你建立对数学密集型程序速度的直觉，并让你了解优化时要实现的上限。不过，我从来没有尝试将其与程序的实际运行时间直接关联起来，因为太多其他因素会影响实际的最终性能。

1.) Floating point operations simply represent a wider range of math than fixed-width integers. Additionally, heavily numerical or scientific applications (which would typically be the one who actually test a CPU's pure computational power) probably rely on Floating point ops more than anything.

2.) They would have to both be float. The CPU won't add an integer and a float, one or the other would implicitly be converted (most likely the integer would be converted to the float ), so it would still just be floating point operations.

3.) That would be 100 floating point operations, as well as 100 integer operations, as well as some (100?) control-flow/branch/comparison operations. There'd generally also be loads and stores but you don't seem to be storing the value :)

4.) I'm not sure how to begin with this one, you seem to have a general perspective on the material, but you have confused some of the details. Yes an individual instruction may be partitioned into sections similar to:

|OP CODE | Operand 1 | Operand 2 | (among many, many others)

However, operand 1 and operand 2 don't have to contain the actual values to be added. They could just contain the registers to be added. For example take this SSE instruction:

mulps      %%xmm3, %%xmm1

It's telling the execution unit to multiply the contents of register xmm3 and the contents of xmm1 and store the result in xmm3. Since the registers hold 128-bit values, I'm doing the operation on 128-bit values, this is independent of the size of the instruction. Unfortunately x86 does not have a similar instruction breakdown as MIPS due to it being a CISC architecture. An x86 instruction can have anywhere between 1 and 16(!) bytes.

As for your question, I think this is all very fun stuff to know, and it helps you build intuition about the speed of math-intensive programs, as well as giving you a sense of upper limits to be achieved when optimizing. I'd never try and directly correlate this to the actual run time of a program though, as too many other factors contribute to the actual end performance.

回复收藏 0 原文

沙沙粒小 2024-08-14 00:11:37

浮点和整数运算在芯片上使用不同的管道，因此它们以不同的速度运行（在简单/足够旧的架构上，可能根本没有本机浮点支持，使得浮点运算非常慢）。因此，如果您尝试估计使用浮点数学的问题的实际性能，您需要知道这些运算的速度有多快。
是的，您必须使用浮点数据。请参阅#1。
FLOP 通常被定义为特定操作混合的平均值，旨在代表您想要建模的现实世界问题。对于您的循环，您只需将每次加法算作 1 次操作，总共 100 次操作。但是：这并不代表大多数现实世界的工作并且您可能必须采取措施防止编译器优化所有工作。
矢量化或 SIMD（单指令多数据）可以做到这一点。目前使用的 SIMD 系统示例包括 AltiVec（在 PowerPC 系列芯片上）和 Intel x86 上的 MMX/SSE/...以及兼容的系统。芯片的这种改进应该因完成更多工作而受到赞扬，因此即使只有 25 个取指和工作周期，上面的琐碎循环仍将被算作 100 次操作。编译器要么需要非常智能，要么接收程序员的提示来使用 SIMD 单元（但现在大多数一线编译器都非常智能）。

回复收藏 0 原文

浅唱々樱花落 2024-08-14 00:11:37

每秒浮点运算数。

http://www.webopedia.com/TERM/F/FLOPS.html

您的示例是 100 次浮点运算（将两个浮点数加在一起是一次浮点运算）。分配浮点数可能会也可能不会。

该术语显然不是一个精确的测量，因为很明显，双精度浮点运算将比单精度浮点运算花费更长的时间，而乘法和除法将比加法和减法花费更长的时间。正如维基百科文章所证明的那样，最终有更好的方法来衡量性能。

回复收藏 0 原文

尘世孤行 2024-08-14 00:11:37

1) 因为许多现实世界的应用程序运行时都会处理大量浮点数，例如所有基于矢量的应用程序（游戏、CAD 等）几乎完全依赖于浮点运算。

2) FLOPS 用于浮点运算。

3) 100。流程控制使用整数运算

4) 该架构最适合 ALU。浮点表示可以使用 96-128 位。

回复收藏 0 原文

提笔落墨 2024-08-14 00:11:37

浮点运算是某些计算问题的限制因素。如果您的问题不是其中之一，您可以安全地忽略失败评级。

Intel 架构从简单的 80 位浮点指令开始，可以通过舍入加载或存储到 64 位内存位置。后来他们添加了 SSE 指令，该指令使用 128 位寄存器，可以执行多个浮点运算一条指令。

回复收藏 0 原文

囍笑 2024-08-14 00:11:37

哎呀，简化的 MIPS。通常，这对于入门课程来说就很好。我会假设一本轩尼诗/帕特森的书？

阅读 Intel 方法的 Pentium 架构 (586) 的 MMX 指令。或者，更一般地说，研究 SIMD 架构，也称为矢量处理器架构。它们首先由 Cray 超级计算机普及（尽管我认为有一些先驱者）。对于现代 SIMD 方法，请参阅 NVIDIA 生产的 CUDA 方法或市场上不同的 DSP 处理器。

回复收藏 0 原文

神魇的王 2024-08-14 00:11:37

浮点速度对于科学计算和计算机图形学非常重要。
根据定义，不。此时您正在测试整数性能。
302，见下文。
x86 和 x64 与 MIPS 有很大不同。 MIPS 是一种 RISC（精简指令集计算机）架构，与 Intel 和 AMD 产品的 CISC（复杂指令集计算机）架构相比，其指令数量非常少。对于指令解码，x86 使用可变宽度指令，因此指令长度从 1 到 16 个字节不等（包括前缀，可能更大）。

128 位是关于处理器中浮点数的内部表示。它在内部使用真正的位浮点来尝试避免舍入错误，然后在将数字放回内存时截断它们。

fld  A      //st=[A]
fld  B      //st=[B, A]
Loop:
fld st(1)   //st=[A, B, A]
fadd st(1)  //st=[A + B, B, A]
fstp memory //st=[B, A]

Floating point speed mattered a lot for scientific computing and computer graphics.
By definition, no. You're testing integer performance at that point.
302, see below.
x86 and x64 are very different from MIPS. MIPS, being a RISC (reduced instruction set computer) architecture, has very few instructions in comparison to the CISC (complex instruction set computer) architecture of Intel and AMD's offerings. For instruction decoding, x86 using variable width instructions, so instructions anywhere from one to 16 bytes in length (including prefixes, it might be larger)

The 128 bit thing is about the internal representation of floats in the processor. It uses really bit floats internally to try and avoid rounding errors, and then truncates them when you put the numbers back into memory.

fld  A      //st=[A]
fld  B      //st=[B, A]
Loop:
fld st(1)   //st=[A, B, A]
fadd st(1)  //st=[A + B, B, A]
fstp memory //st=[B, A]

回复收藏 0 原文

酸甜透明夹心 2024-08-14 00:11:37

浮点数学在很多方面都比整数数学好得多。大多数大学计算机科学课程都有一门名为“数值分析”的课程。
向量元素必须是 float、double 或 long double。内积计算将比元素为整数时慢。
这将是 100 个浮点加法。（也就是说，除非编译器意识到对结果没有做任何事情并优化整个结果。）
计算机使用各种内部格式来表示浮点数。在您提到的示例中，CPU 在对数字进行运算之前会将 32 位浮点数转换为其内部 128 位格式。

除了其他答案提到的用途之外，如今被称为“宽客”的人还使用浮点数学进行金融分析。大卫·E·肖 (David E. Shaw) 于 1988 年开始将浮点数学应用于华尔街建模，截至 2009 年 9 月 30 日，他的身家达到 25 亿美元，在福布斯美国 400 名最富有的人排行榜上排名第 123 位。

所以值得学习一些浮点数学！

回复收藏 0 原文

女皇必胜 2024-08-14 00:11:37

1）浮点数很重要，因为有时我们想要表示非常大或非常小的数字，而整数对此不太擅长。阅读 IEEE-754 标准，但尾数就像整数部分，我们用一些位来代替指数，这样可以表示更广泛的数字范围。

2) 如果两个向量都是整数，则无法测量 FLOPS。如果一个向量是 int，另一个向量是 float，那么您将进行大量 int->float 转换，我们可能应该将这种转换视为 FLOP。

3/4) Intel 架构上的浮点运算确实非常奇特。它实际上是一个基于堆栈的单操作数指令集（通常）。例如，在您的示例中，您将使用一条带有操作码的指令，将内存操作数加载到 FPU 堆栈的顶部，然后使用另一条带有操作码的指令，将内存操作数添加到 FPU 堆栈的顶部，然后最后是另一条带有操作码的指令，该操作码将 FPU 堆栈的顶部弹出到内存操作数。

这个网站列出了很多操作。

http://www.website.masmforum.com/tutorials/fptute/appen1。 htm

我确信英特尔会在某处发布实际的操作码，如果您真的那么感兴趣的话。

回复收藏 0 原文

~没有更多了~