x87 相对于 SSE 的优势
我知道 x87 具有更高的内部精度,这可能是人们看到的它与 SSE 操作之间最大的区别。但我想知道,使用 x87 还有其他好处吗?我有在任何项目中自动输入 -mfpmath=sse
的习惯,我想知道我是否错过了 x87 FPU 提供的其他功能。
I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse
automatically in any project, and I wonder if I'm missing anything else that the x87 FPU offers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
对于手写asm,x87有一些SSE指令集中不存在的指令。
在我的脑海里,都是三角函数,比如 fsin、fcos、fatan、fatan2 和一些指数/对数的东西。
使用
gcc -O3 -ffast-math -mfpmath=387
,GCC9将实际上仍然内联sin(x)
作为fsin< /code> 指令,无论 libm 中的实现使用什么。 (https://godbolt.org/z/Euc5gp)。
MSVC 在编译 32 位 x86 时调用
__libm_sse2_sin_precise
。如果您的代码大部分时间都花在做三角函数上,那么如果您使用 x87,您可能会看到轻微的性能增益或损失,具体取决于使用 SSE1/SSE2 的标准数学库实现比
的慢速微代码更快还是更慢fsin
在您使用的任何 CPU 上。CPU 供应商并没有投入大量精力来优化最新一代 CPU 中 x87 指令的微代码,因为它通常被认为已过时且很少使用。 (查看Agner Fog 指令表中最近几代 CPU 中复杂 x87 指令的 uop 计数和吞吐量:更多周期比旧的 CPU)。 CPU 越新,x87 计算 log、exp、pow 或 trig 函数的速度就越有可能比许多 SSE 或 AVX 指令慢。
即使 x87 可用,也不是所有数学库都选择使用像
fsin
这样的复杂指令来实现像sin()
这样的函数,或者特别是 exp/log,其中使用整数技巧来操作基于日志的 FP 位模式很有用。一些 DSP 算法使用大量三角函数,但通常可以从 SIMD 数学库的自动矢量化中受益匪浅。
然而,对于您花费大部分时间进行加法、乘法等的数学代码,SSE 通常更快。
另相关:英特尔低估错误界限为 1.3 quintillion -
fsin
的最坏情况(非常接近 pi 的fsin
输入发生灾难性取消)非常糟糕。软件可以做得更好,但只能使用缓慢的扩展精度技术。For hand-written asm, x87 has some instructions that don't exist in the SSE instruction set.
Off the top of my head, it's all trigonometric stuff like fsin, fcos, fatan, fatan2 and some exponential/logarithm stuff.
With
gcc -O3 -ffast-math -mfpmath=387
, GCC9 will still actually inlinesin(x)
as anfsin
instruction, regardless of what the implementation in libm would have used. (https://godbolt.org/z/Euc5gp).MSVC calls
__libm_sse2_sin_precise
when compiling for 32-bit x86.If your code spends most of the time doing trigonometry, you may see a slight performance gain or loss if you use x87, depending on whether your standard math-library implementation using SSE1/SSE2 is faster or slower than the slow microcode for
fsin
on whatever CPU you're using.CPU vendors don't put a lot of effort into optimizing the microcode for x87 instructions in the newest generations of CPUs because it's generally considered obsolete and rarely used. (Look at uop counts and throughput for complex x87 instructions in Agner Fog's instruction tables in recent generations of CPUs: more cycles than in older CPUs). The newer the CPU, the more likely x87 will be slower than many SSE or AVX instructions to compute log, exp, pow, or trig functions.
Even when x87 is available, not all math libraries choose to use complex instructions like
fsin
for implementing functions likesin()
, or especially exp/log where integer tricks for manipulating the log-based FP bit-patterns are useful.Some DSP algorithms use a lot of trig, but typically benefit a lot from auto-vectorization with SIMD math libraries.
However, for math-code where you spend most of your time doing additions, multiplications etc. SSE is usually faster.
Also related: Intel Underestimates Error Bounds by 1.3 quintillion - the worst case for
fsin
(catastrophic cancellation forfsin
inputs very near pi) is very bad. Software can do better but only with slow extended-precision techniques.EOF
EOF
FPU 指令比 SSE 指令小,因此它们非常适合演示场景
FPU instructions are smaller than SSE instructions, so they are ideal for demoscene stuff
与 x87 具有相当大的遗留系统和小型系统兼容性:SSE 是一个相对较新的处理器功能。如果您的代码要在嵌入式微控制器上运行,则它很可能不支持 SSE 指令。
即使没有安装 FPU 的系统通常也会提供 80x87 模拟器,这将使代码透明地运行(或多或少)。我不知道有任何 SSE 模拟器 - 当然我的系统之一没有任何模拟器,因此最新的 Adobe Photoshop elements 版本无法运行。
80x87指令具有良好的并行操作特性,自1982年左右推出以来,已经对其进行了深入的探索和分析。 x86 的各种克隆可能会在 SSE 指令上停止。
There is considerable legacy and small system compatibility with the x87: SSE is a relatively new processor feature. If your code is to run on an embedded microcontroller, there's a good chance it won't support SSE instructions.
Even systems which don't have an FPU installed will often provide 80x87 emulators which will make the code run transparently (more or less). I don't know of any SSE emulators—certainly one of my systems doesn't have any, so the newest Adobe Photoshop elements versions refuse to run.
The 80x87 instructions have good parallel operation characteristics which have been thoroughly explored and analyzed since its introduction in 1982 or so. Various clones of the x86 might stall on an SSE instructions.
使用 x87(通常免费)进行
float
和double
之间的转换比使用 SSE 更快。使用 x87,您可以将float
、double
或long double
加载到寄存器堆栈或从寄存器堆栈存储它们,并将其转换为扩展或从扩展精度,无需额外成本。对于 SSE,如果类型混合,则需要额外的指令来执行类型转换,因为寄存器包含float
或double
值。这些转换指令相当快,但确实需要额外的时间。真正的解决办法是避免过度混合
float
和double
,当然也不要使用 x87。Conversion between
float
anddouble
is faster with x87 (usually free) than with SSE. With x87, you can load and store afloat
,double
orlong double
to or from the register stack and it is converted to or from extended precision without extra cost. With SSE, additional instructions are required to do the type conversion if types are mixed, because the registers containfloat
ordouble
values. These conversion instructions are fairly fast but do take extra time.The real fix is to refrain from mixing
float
anddouble
excessively, not to use x87, of course.x87 支持 80 位浮点数学,但如今它被认为已经过时,取而代之的是 64 位浮点数学。某些编译器仍然允许您将其与 long double 类型一起使用,但其他编译器(例如 MSVC)不支持该类型。
如果您确实需要 80 位浮点数学(例如,使用该格式的旧二进制数据,或模拟 x87 数学协处理器),请确保您使用支持它的合适编译器。
x87 supports 80-bit floating point math, which is considered obsolete today in favor of 64-bit floating point math. Some compilers will still let you use it with the
long double
type, but others (such as MSVC) do not support that type.If you have a real need for 80-bit floating point math (for example, working with old binary data in that format, or emulating an x87 math coprocessor), make sure you are using a suitable compiler which supports it.