MSVC内联装配:释放FPU寄存器以进行性能
在使用MSVC的Inline Assembly播放FPU的同时,我对释放FPU寄存器以提高性能而感到有些困惑...
例如:
#include <stdio.h>
double fpu_add(register double x, register double y) {
double res = 0.0;
__asm {
fld x
fld y
fadd
fstp res
}
return res;
}
int main(void) {
double x = fpu_add(5.0, 2.0);
(void) printf("x = %f\n", x);
return 0;
}
我何时必须ffree
inline Assembly中的FPU ?
在该示例中,如果我决定ffree
st(1)
注册会更好?
FSTP
shorthand是下面的说明吗?
__asm {
fst res
ffree st(0)
}
注意:我知道FPU指令如今有些古老,但是将它们作为另一种选择与SSE
While playing a little with FPU using MSVC's Inline Assembly, I got a little confused about freeing FPU registers in favor of increasing performance...
For example:
#include <stdio.h>
double fpu_add(register double x, register double y) {
double res = 0.0;
__asm {
fld x
fld y
fadd
fstp res
}
return res;
}
int main(void) {
double x = fpu_add(5.0, 2.0);
(void) printf("x = %f\n", x);
return 0;
}
When do I have to ffree
the FPU registers in Inline Assembly?
In that example would performance be better if I decided to ffree
the st(1)
register?
Also is fstp
a shorthand for instructions below?
__asm {
fst res
ffree st(0)
}
NOTE: I know FPU instructions are a bit old nowdays, But dealing with them as another option along with SSE
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
ffree
指令允许您将x87 fo堆栈的任何插槽标记为free
,而无需实际更改堆栈指针。因此,ffree st(0)
die 弹出堆栈,只将堆栈的最高值标记为免费/无效,因此任何尝试访问它的以下说明都将获得浮点例外。要实际弹出堆栈,您需要
ffree st(0)
和fincstp
(要增加指针)。或者更好,FSTP ST(0)
可以通过单个便宜的指令来完成这两个事情。或FSTP ST(1)
保持堆栈最高值并丢弃旧的st(1)
。但是,使用
p
其他说明的后缀版本通常更好,更容易(更快)。就您而言,您可能希望最终会推出并弹出两个值,而FP堆栈与以前相同。如果编译器正在生成X87 FP代码,则在FP堆栈上留出的东西可能会引起其他FP代码的问题,因此应避免。
甚至更好,如果您对CPU进行优化,则使用Memory-Source
FADD
保存说明。 (检查 agner fog的 microarch pdf和P5 Pentium and For甚至可以在更现代的CPU上保存一个UOP,例如Core2,它可以对内存源操作数进行微融合。)但是,MSVC内联ASM本质上是固有的,用于包裹单个指令,例如
fadd
,强迫输入是在内存中,即使编译器在ASM语句之前将其提供在寄存器中。并强迫结果存储在ASM中,然后重新加载return
语句,除非您使用hack之类的hack,例如在st(0)没有
返回
语句的功能结束。 (MSVC实际上即使在内线时也确实支持了这一点,但是clang-cl/clang-frasm-blocks
不。要求在X87寄存器中要求输入的约束,并告诉编译器输出在哪里(在
st(0)
中),但是您仍然必须在fadd
和<之间进行选择。代码> FADDP ,不要根据寄存器中的值还是从内存中的值来允许编译器选择。 ( https://stackoverflow.com/tags/inline-assembly/info/info )编译器并不可怕,他们并不可怕,他们将至少从普通C来源使代码变得好。 Inline ASM通常对性能没有用,除非您编写了一个针对特定CPU进行精心调整的整个循环,或者对于编译器在某些方面做得不好的情况。 (查看编译器的优化ASM输出,例如 https://godbolt.org/ )
The
ffree
instruction allows you to mark any slot of the x87 fo stack asfree
without actually changing the stack pointer. Soffree st(0)
does NOT pop the stack, just marks the top value of the stack as free/invalid, so any following instruction that tries to access it will get a floating point exception.To actually pop to the stack you need both
ffree st(0)
andfincstp
(to increment the pointer). Or better,fstp st(0)
to do both those things with a single cheap instruction. Orfstp st(1)
to keep the top-of-stack value and discard the oldst(1)
.But it's usually even better and easier (and faster) to use the
p
suffixed versions of other instructions. In your case, you probably wantThis ends up pushing and popping two values, leaving the fp stack in the same state as it was before. Leaving stuff on the fp stack is likely to cause problems with other fp code, if the compiler is generating x87 fp code, so should be avoided.
Or even better, use memory-source
fadd
to save instructions, if you're optimizing for CPUs where that's not slower. (Check Agner Fog's microarch PDF and instruction tables for P5 Pentium and newer: seems to be fine, at least break even, and saves a uop on more modern CPUs like Core2 that can do micro-fusion of memory source operands.)But MSVC inline asm is inherently slow for wrapping a single instruction like
fadd
, forcing inputs to be in memory, even if the compiler had them available in registers before the asm statement. And forcing the result to be stored in the asm and then reloaded for thereturn
statement, unless you use a hack like leaving a value inst(0)
and falling off the end of a function without areturn
statement. (MSVC does actually support this even when inlining, but clang-cl / clang-fasm-blocks
does not.)GNU C inline asm could wrap a single
fadd
instruction with appropriate constraints to ask for inputs in x87 registers and tell the compiler where the output is (inst(0)
), but you'd still have to choose betweenfadd
andfaddp
, not letting the compiler pick based on whether it had values in registers or a value from memory. (https://stackoverflow.com/tags/inline-assembly/info)Compilers aren't terrible, they will make code at least this good from plain C source. Inline asm is generally not useful for performance, unless you're writing a whole loop that's carefully tuned for a specific CPU, or for a case where the compiler does a poor job with something. (Look at the compiler's optimized asm output, e.g. on https://godbolt.org/)