FLOPS 什么是真正的 FLOPS
我来自这个线程: FLOPS Intel core 并用 C 测试它(innerproduct)
当我开始编写简单的测试脚本时,我想到了几个问题。
为什么是浮点数?浮点数有什么重要之处值得我们考虑?为什么不是简单的 int?
如果我想测量 FLOPS,假设我正在计算两个向量的内积。两个向量必须是 float[] 吗?如果我使用 int[],测量结果会有什么不同?
我不熟悉英特尔架构。假设我有以下操作:
浮点 a = 3.14159;浮点数 b = 3.14158; for(int i = 0; i < 100; ++i) { a + b; }
这是多少个“浮点运算”?
我有点困惑,因为我研究了简化的 32 位 MIPS 架构。对于每条指令,都有 32 位,例如操作数 1 为 5 位,操作数 2 为 5 位等。因此,对于英特尔架构(特别是与上一个线程相同的架构),我被告知寄存器可以保存 128 位。对于单精度浮点,每个浮点数 32 位,这是否意味着对于馈送到处理器的每条指令,它可能需要 4 浮点数?我们是否还必须考虑操作数和指令其他部分中涉及的位?我们怎么能只向CPU提供4个浮点数而没有任何具体含义呢?
我不知道我这种零碎思考一切的方法是否有意义。如果不是,我应该看什么“高度”的视角?
I came from this thread: FLOPS Intel core and testing it with C (innerproduct)
As I began writing simple test scripts, a few questions came into my mind.
Why floating point? What is so significant about floating point that we have to consider? Why not a simple int?
If I want to measure FLOPS, let say I am doing the inner product of two vectors. Must the two vectors be float[] ? How will the measurement be different if I use int[]?
I am not familiar with Intel architectures. Let say I have the following operations:
float a = 3.14159; float b = 3.14158; for(int i = 0; i < 100; ++i) { a + b; }
How many "floating point operations" is this?
I am a bit confused because I studied a simplified 32bit MIPS architecture. For every instruction, there is 32 bits, like 5 bit for operand 1 and 5 bit for operand 2 etc. so for intel architectures (specifically the same architecture from the previous thread), I was told that the register can hold 128 bit. For SINGLE PRECISION floating point, 32bit per float point number, does that mean for each instruction fed to the processor, it can take 4
floating point numbers? Don't we also have to account for bits involved in operands and other parts of the instruction? How can we just feed 4 floating point numbers to a cpu without any specific meaning to this?
I don't know whether my approach of thinking everything in bits and pieces make sense. If not, what "height" of perspective should I be looking at?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
1.) 浮点运算代表比固定宽度整数更广泛的数学范围。此外,大量数值或科学应用程序(通常是实际测试 CPU 纯计算能力的应用程序)可能比任何东西都更依赖浮点运算。
2.) 它们都必须是浮动的。 CPU不会将整数和浮点数相加,其中之一会被隐式转换(很可能整数会转换为浮点数),因此它仍然只是浮点运算。
3.) 这将是 100 个浮点运算,以及 100 个整数运算,以及一些(100?)控制流/分支/比较运算。通常还会有加载和存储,但您似乎没有存储该值:)
4.) 我不知道如何开始这一点,您似乎对材料有一个总体看法,但您混淆了一些细节。是的,单个指令可以分为类似于以下的部分:
但是,操作数 1 和操作数 2 不必包含要添加的实际值。它们可以只包含要添加的寄存器。例如,以这条 SSE 指令为例:
它告诉执行单元将寄存器 xmm3 的内容与 xmm1 的内容相乘,并将结果存储在 xmm3 中。由于寄存器保存 128 位值,因此我对 128 位值进行操作,这与指令的大小无关。不幸的是,x86 由于是 CISC 架构,因此没有与 MIPS 类似的指令分解。 x86 指令可以具有 1 到 16(!) 字节之间的任何位置。
至于你的问题,我认为这些都是非常有趣的知识,它可以帮助你建立对数学密集型程序速度的直觉,并让你了解优化时要实现的上限。不过,我从来没有尝试将其与程序的实际运行时间直接关联起来,因为太多其他因素会影响实际的最终性能。
1.) Floating point operations simply represent a wider range of math than fixed-width integers. Additionally, heavily numerical or scientific applications (which would typically be the one who actually test a CPU's pure computational power) probably rely on Floating point ops more than anything.
2.) They would have to both be float. The CPU won't add an integer and a float, one or the other would implicitly be converted (most likely the integer would be converted to the float ), so it would still just be floating point operations.
3.) That would be 100 floating point operations, as well as 100 integer operations, as well as some (100?) control-flow/branch/comparison operations. There'd generally also be loads and stores but you don't seem to be storing the value :)
4.) I'm not sure how to begin with this one, you seem to have a general perspective on the material, but you have confused some of the details. Yes an individual instruction may be partitioned into sections similar to:
However, operand 1 and operand 2 don't have to contain the actual values to be added. They could just contain the registers to be added. For example take this SSE instruction:
It's telling the execution unit to multiply the contents of register xmm3 and the contents of xmm1 and store the result in xmm3. Since the registers hold 128-bit values, I'm doing the operation on 128-bit values, this is independent of the size of the instruction. Unfortunately x86 does not have a similar instruction breakdown as MIPS due to it being a CISC architecture. An x86 instruction can have anywhere between 1 and 16(!) bytes.
As for your question, I think this is all very fun stuff to know, and it helps you build intuition about the speed of math-intensive programs, as well as giving you a sense of upper limits to be achieved when optimizing. I'd never try and directly correlate this to the actual run time of a program though, as too many other factors contribute to the actual end performance.
浮点和整数运算在芯片上使用不同的管道,因此它们以不同的速度运行(在简单/足够旧的架构上,可能根本没有本机浮点支持,使得浮点运算非常 慢)。因此,如果您尝试估计使用浮点数学的问题的实际性能,您需要知道这些运算的速度有多快。
是的,您必须使用浮点数据。请参阅#1。
FLOP 通常被定义为特定操作混合的平均值,旨在代表您想要建模的现实世界问题。对于您的循环,您只需将每次加法算作 1 次操作,总共 100 次操作。 但是:这并不代表大多数现实世界的工作并且您可能必须采取措施防止编译器优化所有工作。
矢量化或 SIMD(单指令多数据)可以做到这一点。目前使用的 SIMD 系统示例包括 AltiVec(在 PowerPC 系列芯片上)和 Intel x86 上的 MMX/SSE/...以及兼容的系统。芯片的这种改进应该因完成更多工作而受到赞扬,因此即使只有 25 个取指和工作周期,上面的琐碎循环仍将被算作 100 次操作。编译器要么需要非常智能,要么接收程序员的提示来使用 SIMD 单元(但现在大多数一线编译器都非常智能)。
Floating point and integer operation use different pipelines on the chip, so they run at different speeds (on simple/old enough architectures there may be no native floating point support at all, making floating point operation very slow). So if you are trying to estimate real world performance for problems that use floating point math, you need to know how fast these operation are.
Yes, you must use floating point data. See #1.
A FLOP is typically defined as an average over a particular mixture of operations that is intended to be representative of the real world problem you want to model. For your loop, you would just count each addition as 1 operation making a total of 100 operations. BUT: this is not representative of most real world jobs and you may have to take steps to prevent the compiler from optimizing all the work out.
Vectorized or SIMD (Single Instruction Multiple Data) can do exactly that. Example of SIMD systems in use right now include AltiVec (on PowerPC series chips) and MMX/SSE/... on Intel x86 and compatible. Such improvements in chips should get credit for doing more work, so your trivial loop above would still be counted as 100 operation even if there are only 25 fetch and work cycles. Compilers either need to be very smart, or receive hints from the programmer to make use of SIMD units (but most front-line compilers are very smart these days).
每秒浮点运算数。
http://www.webopedia.com/TERM/F/FLOPS.html
您的示例是 100 次浮点运算(将两个浮点数加在一起是一次浮点运算)。分配浮点数可能会也可能不会。
该术语显然不是一个精确的测量,因为很明显,双精度浮点运算将比单精度浮点运算花费更长的时间,而乘法和除法将比加法和减法花费更长的时间。正如维基百科文章所证明的那样,最终有更好的方法来衡量性能。
Floating Point Operations per Second.
http://www.webopedia.com/TERM/F/FLOPS.html
Your example is 100 floating point operations (adding the two floating point numbers together is one floating point operation). Allocating floating point numbers may or may not count.
The term is apparently not an exact measurement, as it is clear that a double-precision floating-point operation is going to take longer than a single-precision one, and multiplication and division are going to take longer than addition and subtraction. As the Wikipedia article attests, there are ultimately better ways to measure performance.
1) 因为许多现实世界的应用程序运行时都会处理大量浮点数,例如所有基于矢量的应用程序(游戏、CAD 等)几乎完全依赖于浮点运算。
2) FLOPS 用于浮点运算。
3) 100。流程控制使用整数运算
4) 该架构最适合 ALU。浮点表示可以使用 96-128 位。
1) Because many real world application runs crunching a lot of floating point numbers, by example all vector based apps (games, CAD, etc) relies almost entirely in floating point operations.
2) FLOPS is for Floating Point operations.
3) 100. The flow control use integer operations
4) That architecture is best suited for ALU. Floating point representations can use 96-128 bits.
浮点运算是某些计算问题的限制因素。如果您的问题不是其中之一,您可以安全地忽略失败评级。
Intel 架构从简单的 80 位浮点指令开始,可以通过舍入加载或存储到 64 位内存位置。后来他们添加了 SSE 指令,该指令使用 128 位寄存器,可以执行多个浮点运算一条指令。
Floating point operations are the limiting factor in certain computing problems. If your problem isn't one of them, you can safely ignore flops ratings.
Intel architecture started out with simple 80 bit floating point instructions, which can load or store to 64 bit memory locations with rounding. Later they added the SSE instructions, which use 128 bit registers and can do multiple floating point operations with a single instruction.
哎呀,简化的 MIPS。通常,这对于入门课程来说就很好。我会假设一本轩尼诗/帕特森的书?
阅读 Intel 方法的 Pentium 架构 (586) 的 MMX 指令。或者,更一般地说,研究 SIMD 架构,也称为矢量处理器架构。它们首先由 Cray 超级计算机普及(尽管我认为有一些先驱者)。对于现代 SIMD 方法,请参阅 NVIDIA 生产的 CUDA 方法或市场上不同的 DSP 处理器。
Yuck, simplified MIPS. Typically, that's fine for intro courses. I'm going to assume a hennesy/patterson book?
Read up on the MMX instructions for the Pentium architecture(586) for the Intel approach. Or, more generally, study the SIMD architectures, which are also known as vector processor architectures. They were first popularized by the Cray supercomputers(although I think there were a few forerunners). For a modern SIMD approach, see the CUDA approach produced by NVIDIA or the different DSP processors on the market.
128 位是关于处理器中浮点数的内部表示。它在内部使用真正的位浮点来尝试避免舍入错误,然后在将数字放回内存时截断它们。
The 128 bit thing is about the internal representation of floats in the processor. It uses really bit floats internally to try and avoid rounding errors, and then truncates them when you put the numbers back into memory.
浮点数学在很多方面都比整数数学好得多。大多数大学计算机科学课程都有一门名为“数值分析”的课程。
向量元素必须是 float、double 或 long double。内积计算将比元素为整数时慢。
这将是 100 个浮点加法。 (也就是说,除非编译器意识到对结果没有做任何事情并优化整个结果。)
计算机使用各种内部格式来表示浮点数。在您提到的示例中,CPU 在对数字进行运算之前会将 32 位浮点数转换为其内部 128 位格式。
除了其他答案提到的用途之外,如今被称为“宽客”的人还使用浮点数学进行金融分析。大卫·E·肖 (David E. Shaw) 于 1988 年开始将浮点数学应用于华尔街建模,截至 2009 年 9 月 30 日,他的身家达到 25 亿美元,在福布斯美国 400 名最富有的人排行榜上排名第 123 位。
所以值得学习一些浮点数学!
There are lots of things floating point math does far better than integer math. Most university computer science curricula have a course on it called "numerical analysis".
The vector elements must be float, double, or long double. The inner product calculation will be slower than if the elements were ints.
That would be 100 floating point adds. (That is, unless the compiler realized nothing is ever done with the result and optimizes the whole thing away.)
Computers use a variety of internal formats to represent floating point numbers. In the example you mention, the CPU would convert the 32-bit float into its internal 128-bit format before doing operations on the number.
In addition to uses other answers have mentioned, people called "quants" use floating point math for finance these days. A guy named David E. Shaw started applying floating point math to modeling Wall Street in 1988, and as of Sept. 30, 2009, is worth $2.5 billion and ranks #123 on the Forbes list of the 400 richest Americans.
So it's worth learning a bit about floating point math!
1)浮点数很重要,因为有时我们想要表示非常大或非常小的数字,而整数对此不太擅长。阅读 IEEE-754 标准,但尾数就像整数部分,我们用一些位来代替指数,这样可以表示更广泛的数字范围。
2) 如果两个向量都是整数,则无法测量 FLOPS。如果一个向量是 int,另一个向量是 float,那么您将进行大量 int->float 转换,我们可能应该将这种转换视为 FLOP。
3/4) Intel 架构上的浮点运算确实非常奇特。它实际上是一个基于堆栈的单操作数指令集(通常)。例如,在您的示例中,您将使用一条带有操作码的指令,将内存操作数加载到 FPU 堆栈的顶部,然后使用另一条带有操作码的指令,将内存操作数添加到 FPU 堆栈的顶部,然后最后是另一条带有操作码的指令,该操作码将 FPU 堆栈的顶部弹出到内存操作数。
这个网站列出了很多操作。
http://www.website.masmforum.com/tutorials/fptute/appen1。 htm
我确信英特尔会在某处发布实际的操作码,如果您真的那么感兴趣的话。
1) Floating point is important because sometimes we want to represent really big or really small numbers and integers aren't really so good with that. Read up on the IEEE-754 standard, but the mantissa is like the integer portion, and we trade some bits to work as an exponent, which allows a much more expanded range of numbers to be represented.
2) If the two vectors are ints, you won't measure FLOPS. If one vector is int and another is float, you'll be doing lots of int->float conversions, and we should probably consider such a conversion to be a FLOP.
3/4) Floating point operations on Intel architectures are really quite exotic. It's actually a stack-based, single operand instruction set (usually). For instance, in your example, you would use one instruction with an opcode that loads a memory operand onto the top of the FPU stack, and then you would use another instruction with an opcode that adds a memory operand to the top of the FPU stack, and then finally another instruction with an opcode that pops the top of the FPU stack to the memory operand.
This website lists a lot of the operations.
http://www.website.masmforum.com/tutorials/fptute/appen1.htm
I'm sure Intel publishes the actual opcodes somewhere, if you're really that interested.