MSVC内联装配：释放FPU寄存器以进行性能

发布于 2025-02-10 21:22:35 字数 779 浏览 0 评论 0原文

在使用MSVC的Inline Assembly播放FPU的同时，我对释放FPU寄存器以提高性能而感到有些困惑...

例如：

#include <stdio.h>

double fpu_add(register double x, register double y) {
    double res = 0.0;

    __asm {
        fld x
        fld y
        fadd
        fstp res
    }

    return res;
}

int main(void) {
    double x = fpu_add(5.0, 2.0);
    (void) printf("x = %f\n", x);
    
    return 0;
}

我何时必须ffree inline Assembly中的FPU ？

在该示例中，如果我决定ffree st（1）注册会更好？

FSTP shorthand是下面的说明吗？

__asm {
    fst res
    ffree st(0)
}

注意：我知道FPU指令如今有些古老，但是将它们作为另一种选择与SSE

原文

While playing a little with FPU using MSVC's Inline Assembly, I got a little confused about freeing FPU registers in favor of increasing performance...

For example:

#include <stdio.h>

double fpu_add(register double x, register double y) {
    double res = 0.0;

    __asm {
        fld x
        fld y
        fadd
        fstp res
    }

    return res;
}

int main(void) {
    double x = fpu_add(5.0, 2.0);
    (void) printf("x = %f\n", x);
    
    return 0;
}

When do I have to ffree the FPU registers in Inline Assembly?

In that example would performance be better if I decided to ffree the st(1) register?

Also is fstp a shorthand for instructions below?

__asm {
    fst res
    ffree st(0)
}

NOTE: I know FPU instructions are a bit old nowdays, But dealing with them as another option along with SSE

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

在风中等你 2025-02-17 21:22:36

ffree指令允许您将x87 fo堆栈的任何插槽标记为free，而无需实际更改堆栈指针。因此，ffree st（0） die 弹出堆栈，只将堆栈的最高值标记为免费/无效，因此任何尝试访问它的以下说明都将获得浮点例外。

要实际弹出堆栈，您需要ffree st（0）和fincstp（要增加指针）。或者更好，FSTP ST（0）可以通过单个便宜的指令来完成这两个事情。或FSTP ST（1）保持堆栈最高值并丢弃旧的st（1）。

但是，使用p其他说明的后缀版本通常更好，更容易（更快）。就您而言，您可能希望

__asm {
    fld x     // push x on the stack
    fld y     // push y on the stack
    faddp     // pop a value and add it to the (now) tos
    fstp res  // pop and store tos
}

最终会推出并弹出两个值，而FP堆栈与以前相同。如果编译器正在生成X87 FP代码，则在FP堆栈上留出的东西可能会引起其他FP代码的问题，因此应避免。

甚至更好，如果您对CPU进行优化，则使用Memory-Source FADD保存说明。（检查 agner fog的 microarch pdf和P5 Pentium and For甚至可以在更现代的CPU上保存一个UOP，例如Core2，它可以对内存源操作数进行微融合。）

    __asm {
        fld x     // push x on the stack
        fadd y    // ST0 += y
        fstp res  // pop and store tos
    }

但是，MSVC内联ASM本质上是固有的，用于包裹单个指令，例如fadd，强迫输入是在内存中，即使编译器在ASM语句之前将其提供在寄存器中。并强迫结果存储在ASM中，然后重新加载return语句，除非您使用hack之类的hack，例如在st（0）没有返回语句的功能结束。（MSVC实际上即使在内线时也确实支持了这一点，但是clang-cl/clang -frasm-blocks不

。要求在X87寄存器中要求输入的约束，并告诉编译器输出在哪里（在st（0）中），但是您仍然必须在fadd和<之间进行选择。代码> FADDP ，不要根据寄存器中的值还是从内存中的值来允许编译器选择。（ https://stackoverflow.com/tags/inline-assembly/info/info ）

编译器并不可怕，他们并不可怕，他们将至少从普通C来源使代码变得好。 Inline ASM通常对性能没有用，除非您编写了一个针对特定CPU进行精心调整的整个循环，或者对于编译器在某些方面做得不好的情况。（查看编译器的优化ASM输出，例如 https://godbolt.org/ ）

The ffree instruction allows you to mark any slot of the x87 fo stack as free without actually changing the stack pointer. So ffree st(0) does NOT pop the stack, just marks the top value of the stack as free/invalid, so any following instruction that tries to access it will get a floating point exception.

To actually pop to the stack you need both ffree st(0) and fincstp (to increment the pointer). Or better, fstp st(0) to do both those things with a single cheap instruction. Or fstp st(1) to keep the top-of-stack value and discard the old st(1).

But it's usually even better and easier (and faster) to use the p suffixed versions of other instructions. In your case, you probably want

__asm {
    fld x     // push x on the stack
    fld y     // push y on the stack
    faddp     // pop a value and add it to the (now) tos
    fstp res  // pop and store tos
}

This ends up pushing and popping two values, leaving the fp stack in the same state as it was before. Leaving stuff on the fp stack is likely to cause problems with other fp code, if the compiler is generating x87 fp code, so should be avoided.

Or even better, use memory-source fadd to save instructions, if you're optimizing for CPUs where that's not slower. (Check Agner Fog's microarch PDF and instruction tables for P5 Pentium and newer: seems to be fine, at least break even, and saves a uop on more modern CPUs like Core2 that can do micro-fusion of memory source operands.)

    __asm {
        fld x     // push x on the stack
        fadd y    // ST0 += y
        fstp res  // pop and store tos
    }

But MSVC inline asm is inherently slow for wrapping a single instruction like fadd, forcing inputs to be in memory, even if the compiler had them available in registers before the asm statement. And forcing the result to be stored in the asm and then reloaded for the return statement, unless you use a hack like leaving a value in st(0) and falling off the end of a function without a return statement. (MSVC does actually support this even when inlining, but clang-cl / clang -fasm-blocks does not.)

GNU C inline asm could wrap a single fadd instruction with appropriate constraints to ask for inputs in x87 registers and tell the compiler where the output is (in st(0)), but you'd still have to choose between fadd and faddp, not letting the compiler pick based on whether it had values in registers or a value from memory. (https://stackoverflow.com/tags/inline-assembly/info)

Compilers aren't terrible, they will make code at least this good from plain C source. Inline asm is generally not useful for performance, unless you're writing a whole loop that's carefully tuned for a specific CPU, or for a case where the compiler does a poor job with something. (Look at the compiler's optimized asm output, e.g. on https://godbolt.org/)

回复收藏 0 原文

~没有更多了~