“asm”、“__asm”和“asm”有什么区别?和“__asm__”?

发布于 2024-09-11 08:48:26 字数 215 浏览 5 评论 0 原文

据我所知, __asm { ... };__asm__("..."); 之间的唯一区别是第一个使用 mov eax, var 第二个使用 movl %0, %%eax 并在末尾添加 :"=r" (var) 。还有哪些其他差异?那么如果只是 asm 呢?

As far as I can tell, the only difference between __asm { ... }; and __asm__("..."); is that the first uses mov eax, var and the second uses movl %0, %%eax with :"=r" (var) at the end. What other differences are there? And what about just asm?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

久夏青 2024-09-18 08:48:26

MSVC 内联汇编和 GNU C 内联汇编之间存在巨大差异。 GCC 语法旨在实现最佳输出而不浪费指令、包装单个指令或其他内容。 MSVC 语法设计得相当简单,但据我所知,如果没有延迟和输入和输出内存往返的额外指令,就不可能使用它。

(MSVC __asm{ ... } 语法为 clang -fasm-blocks 也支持,区别在于 MSVC 支持在 EAX 中保留一个值并下降在非 void 函数的末尾;clang -fasm-blocks 可能也不会。)

如果出于性能原因而使用内联汇编,则只有在完全用汇编语言编写整个循环时,MSVC 内联汇编才可行,而不是在内联函数中包装短序列。下面的示例(用函数包装 idiv)是 MSVC 不擅长的事情:大约 8 个额外的存储/加载指令。

MSVC 内联 asm(由 MSVC 和可能的 icc 使用,也许在某些商业编译器中也可用):

  • 查看您的 asm 以找出您的代码步骤的寄存器。
  • 只能通过内存传输数据。例如,寄存器中的数据由编译器存储,为您的 mov ecx, shift_count 做好准备。因此,使用编译器不会为您生成的单个 asm 指令涉及在传入和传出时通过内存进行往返。
  • 对初学者更友好,但通常无法避免数据输入/输出的开销。即使除了语法限制之外,当前版本的 MSVC 中的优化器也不擅长围绕内联 asm 块进行优化。

GNU C 内联汇编不是学习汇编的好方法。你必须非常了解 asm,这样你才能告诉编译器你的代码。你必须了解编译器需要知道什么。该答案还包含其他内联汇编指南和问答的链接。 标签 wiki 一般而言有很多关于 asm 的好东西,但只是指向 GNU 内联 asm 的链接。 (该答案中的内容也适用于非 x86 平台上的 GNU 内联 asm。)

GNU C 内联 asm 语法由 gcc、clang、icc 以及某些实现 GNU C 的商业编译器使用:

  • 您有告诉编译器你破坏了什么。如果不这样做,将会导致周围代码以不明显且难以调试的方式被破坏。

  • 功能强大但难以阅读、学习和使用语法来告诉编译器如何提供输入以及在哪里找到输出。例如,"c" (shift_count) 将使编译器在运行内联汇编之前将 shift_count 变量放入 ecx 中。

  • 对于大的代码块来说格外笨重,因为 asm 必须位于字符串常量内。所以你通常需要

     "insn %[inputvar], %%reg\n\t" // 注释
      “insn2 %%reg,%[输出变量]\n\t”
    
  • 非常无情/更努力,但允许较低的开销,特别是。用于包装单个指令。 (包装单个指令是最初的设计意图,这就是为什么您必须专门告诉编译器有关早期破坏者的原因,以阻止它使用相同的寄存器进行输入和输出(如果这是一个问题)。)


示例:全角整数除法( div

在 32 位 CPU 上,将 64 位整数除以 32 位整数,或进行全乘(32x32->64),可以受益于内联汇编。 gcc 和 clang 不利用 idiv 来处理 (int64_t)a / (int32_t)b,可能是因为如果结果不适合 32 位,指令就会出错登记。与这个关于获取商和的问答不同一个 div 的余数,这是内联汇编的一个用例。 (除非有办法通知编译器结果适合,这样 idiv 就不会出错。)

我们将使用调用约定,将一些参数放入寄存器中(即使在 hi >right register),以显示更接近于内联这样的小函数时所看到的情况。


MSVC

使用 inline-asm 时请注意 register-arg 调用约定。显然,内联asm支持的设计/实现非常糟糕,以至于编译器可能无法保存/如果内联 asm 中未使用这些参数,则恢复内联 asm 周围的 arg 寄存器。感谢@RossRidge 指出了这一点。

// MSVC.  Be careful with _vectorcall & inline-asm: see above
// we could return a struct, but that would complicate things
int _vectorcall div64(int hi, int lo, int divisor, int *premainder) {
    int quotient, tmp;
    __asm {
        mov   edx, hi;
        mov   eax, lo;
        idiv   divisor
        mov   quotient, eax
        mov   tmp, edx;
        // mov ecx, premainder   // Or this I guess?
        // mov   [ecx], edx
    }
    *premainder = tmp;
    return quotient;     // or omit the return with a value in eax
}

更新:显然在 eaxedx:eax 中留下了一个值,然后从非 void 函数的末尾脱落(没有 return code>) 即使内联也受支持。我认为只有在 asm 语句之后没有代码时这才有效。请参阅 是否 __asm{};返回 eax 的值? 这避免了输出的存储/重新加载(至少对于商),但我们无法对输入执行任何操作。在具有堆栈参数的非内联函数中,它们已经在内存中,但在这个用例中,我们正在编写一个可以有效内联的小函数。


使用 MSVC 19.00.23026 /O2 在 rextester 上编译(带有 main( ) 查找 exe 的目录并将编译器的 asm 输出转储到 stdout)。

## My added comments use. ##
; ... define some symbolic constants for stack offsets of parameters
; 48   : int ABI div64(int hi, int lo, int divisor, int *premainder) {
    sub esp, 16                 ; 00000010H
    mov DWORD PTR _lo$[esp+16], edx      ## these symbolic constants match up with the names of the stack args and locals
    mov DWORD PTR _hi$[esp+16], ecx

    ## start of __asm {
    mov edx, DWORD PTR _hi$[esp+16]
    mov eax, DWORD PTR _lo$[esp+16]
    idiv    DWORD PTR _divisor$[esp+12]
    mov DWORD PTR _quotient$[esp+16], eax  ## store to a local temporary, not *premainder
    mov DWORD PTR _tmp$[esp+16], edx
    ## end of __asm block

    mov ecx, DWORD PTR _premainder$[esp+12]
    mov eax, DWORD PTR _tmp$[esp+16]
    mov DWORD PTR [ecx], eax               ## I guess we should have done this inside the inline asm so this would suck slightly less
    mov eax, DWORD PTR _quotient$[esp+16]  ## but this one is unavoidable
    add esp, 16                 ; 00000010H
    ret 8

有大量额外的 mov 指令,编译器甚至无法优化其中任何指令。我想也许它会看到并理解内联汇编中的 mov tmp, edx ,并将其存储到 premainder 中。但我猜,这需要在内联 asm 块之前将 premainder 从堆栈加载到寄存器中。

实际上,使用 _vectorcall 时,该函数比使用普通的 everything-on-the-stack ABI 时更糟糕。通过寄存器中的两个输入,它将它们存储到内存中,以便内联汇编可以从命名变量加载它们。如果这是内联的,则更多的参数可能会出现在寄存器中,并且必须将它们全部存储起来,因此 asm 将具有内存操作数!因此,与 gcc 不同,我们并没有从内联中获得太多好处。

在 asm 块内执行 *premainder = tmp 意味着用 asm 编写更多代码,但确实避免了其余部分的完全脑死亡的存储/加载/存储路径。这将指令数总共减少了 2 条,降至 11 条(不包括 ret)。

我试图从 MSVC 中获得最好的代码,而不是“错误地使用它”并创建一个稻草人论点。但据我所知,它对于包装非常短的序列来说是可怕的。 大概有一个 64/32 的内在函数 -> 32 除法允许编译器为这种特殊情况生成良好的代码,因此在 MSVC 上使用内联汇编的整个前提可能是一个稻草人论据。但它确实向您表明,对于 MSVC,内在函数比内联汇编要好得多。


GNU C (gcc/clang/icc)

在内联 div64 时,Gcc 的表现甚至比此处显示的输出更好,因为它通常可以安排前面的代码首先在 edx:eax 中生成 64 位整数。

我无法让 gcc 编译 32 位向量调用 ABI。 Clang 可以,但它在带有 "rm" 约束的内联汇编中表现不佳(在 godbolt 链接上尝试一下:它通过内存弹起函数 arg,而不是使用约束中的寄存器选项)。 64 位 MS 调用约定接近 32 位向量调用,前两个参数位于 edx、ecx 中。不同之处在于,在使用堆栈之前,regs 中多了 2 个参数(并且被调用者不会将参数从堆栈中弹出,这就是 MSVC 输出中 ret 8 的作用。)

// GNU C
// change everything to int64_t to do 128b/64b -> 64b division
// MSVC doesn't do x86-64 inline asm, so we'll use 32bit to be comparable
int div64(int lo, int hi, int *premainder, int divisor) {
    int quotient, rem;
    asm ("idivl  %[divsrc]"
          : "=a" (quotient), "=d" (rem)    // a means eax,  d means edx
          : "d" (hi), "a" (lo),
            [divsrc] "rm" (divisor)        // Could have just used %0 instead of naming divsrc
            // note the "rm" to allow the src to be in a register or not, whatever gcc chooses.
            // "rmi" would also allow an immediate, but unlike adc, idiv doesn't have an immediate form
          : // no clobbers
        );
    *premainder = rem;
    return quotient;
}

使用 gcc -m64 -O3 -mabi=ms -fverbose-asm 编译。使用 -m32,您只需获得 3 个负载、idiv 和一个存储,正如您从 godbolt 链接中的更改内容中看到的那样。

mov     eax, ecx  # lo, lo
idivl  r9d      # divisor
mov     DWORD PTR [r8], edx       # *premainder_7(D), rem
ret

对于 32 位向量调用,gcc 会执行类似

## Not real compiler output, but probably similar to what you'd get
mov     eax, ecx               # lo, lo
mov     ecx, [esp+12]          # premainder
idivl   [esp+16]               # divisor
mov     DWORD PTR [ecx], edx   # *premainder_7(D), rem
ret   8

MSVC 使用 13 条指令(不包括 ret)的操作,而 gcc 则使用 4 条指令。正如我所说,使用内联,它可能只编译为 1 条指令,而 MSVC 仍可能使用 9 条指令。(它赢得了不需要保留堆栈空间或加载 premainder; 我假设它仍然需要存储 3 个输入中的大约 2 个,然后它将它们重新加载到 asm 中,运行 idiv。 >,存储两个输出,并在 asm 外部重新加载它们,因此有 4 个加载/存储用于输入,另外 4 个用于输出。)

There's a massive difference between MSVC inline asm and GNU C inline asm. GCC syntax is designed for optimal output without wasted instructions, for wrapping a single instruction or something. MSVC syntax is designed to be fairly simple, but AFAICT it's impossible to use without the latency and extra instructions of a round trip through memory for your inputs and outputs.

(MSVC __asm{ ... } syntax is also supported by clang -fasm-blocks, with one difference that MSVC supports leaving a value in EAX and falling off the end of a non-void function; clang -fasm-blocksdoesn't. Presumably clang-cl doesn't either.)

If you're using inline asm for performance reasons, this makes MSVC inline asm only viable if you write a whole loop entirely in asm, not for wrapping short sequences in an inline function. The example below (wrapping idiv with a function) is the kind of thing MSVC is bad at: ~8 extra store/load instructions.

MSVC inline asm (used by MSVC and probably icc, maybe also available in some commercial compilers):

  • looks at your asm to figure out which registers your code steps on.
  • can only transfer data via memory. Data that was live in registers is stored by the compiler to prepare for your mov ecx, shift_count, for example. So using a single asm instruction that the compiler won't generate for you involves a round-trip through memory on the way in and on the way out.
  • more beginner-friendly, but often impossible to avoid overhead getting data in/out. Even besides the syntax limitations, the optimizer in current versions of MSVC isn't good at optimizing around inline asm blocks, either.

GNU C inline asm is not a good way to learn asm. You have to understand asm very well so you can tell the compiler about your code. And you have to understand what compilers need to know. That answer also has links to other inline-asm guides and Q&As. The tag wiki has lots of good stuff for asm in general, but just links to that for GNU inline asm. (The stuff in that answer is applicable to GNU inline asm on non-x86 platforms, too.)

GNU C inline asm syntax is used by gcc, clang, icc, and maybe some commercial compilers which implement GNU C:

  • You have to tell the compiler what you clobber. Failure to do this will lead to breakage of surrounding code in non-obvious hard-to-debug ways.

  • Powerful but hard to read, learn, and use syntax for telling the compiler how to supply inputs, and where to find outputs. e.g. "c" (shift_count) will get the compiler to put the shift_count variable into ecx before your inline asm runs.

  • extra clunky for large blocks of code, because the asm has to be inside a string constant. So you typically need

      "insn   %[inputvar], %%reg\n\t"       // comment
      "insn2  %%reg, %[outputvar]\n\t"
    
  • very unforgiving / harder, but allows lower overhead esp. for wrapping single instructions. (wrapping single instructions was the original design intent, which is why you have to specially tell the compiler about early clobbers to stop it from using the same register for an input and output if that's a problem.)


Example: full-width integer division (div)

On a 32bit CPU, dividing a 64bit integer by a 32bit integer, or doing a full-multiply (32x32->64), can benefit from inline asm. gcc and clang don't take advantage of idiv for (int64_t)a / (int32_t)b, probably because the instruction faults if the result doesn't fit in a 32bit register. So unlike this Q&A about getting quotient and remainder from one div, this is a use-case for inline asm. (Unless there's a way to inform the compiler that the result will fit, so idiv won't fault.)

We'll use calling conventions that put some args in registers (with hi even in the right register), to show a situation that's closer to what you'd see when inlining a tiny function like this.


MSVC

Be careful with register-arg calling conventions when using inline-asm. Apparently the inline-asm support is so badly designed/implemented that the compiler might not save/restore arg registers around the inline asm, if those args aren't used in the inline asm. Thanks @RossRidge for pointing this out.

// MSVC.  Be careful with _vectorcall & inline-asm: see above
// we could return a struct, but that would complicate things
int _vectorcall div64(int hi, int lo, int divisor, int *premainder) {
    int quotient, tmp;
    __asm {
        mov   edx, hi;
        mov   eax, lo;
        idiv   divisor
        mov   quotient, eax
        mov   tmp, edx;
        // mov ecx, premainder   // Or this I guess?
        // mov   [ecx], edx
    }
    *premainder = tmp;
    return quotient;     // or omit the return with a value in eax
}

Update: apparently leaving a value in eax or edx:eax and then falling off the end of a non-void function (without a return) is supported, even when inlining. I assume this works only if there's no code after the asm statement. See Does __asm{}; return the value of eax? This avoids the store/reloads for the output (at least for quotient), but we can't do anything about the inputs. In a non-inline function with stack args, they will be in memory already, but in this use-case we're writing a tiny function that could usefully inline.


Compiled with MSVC 19.00.23026 /O2 on rextester (with a main() that finds the directory of the exe and dumps the compiler's asm output to stdout).

## My added comments use. ##
; ... define some symbolic constants for stack offsets of parameters
; 48   : int ABI div64(int hi, int lo, int divisor, int *premainder) {
    sub esp, 16                 ; 00000010H
    mov DWORD PTR _lo$[esp+16], edx      ## these symbolic constants match up with the names of the stack args and locals
    mov DWORD PTR _hi$[esp+16], ecx

    ## start of __asm {
    mov edx, DWORD PTR _hi$[esp+16]
    mov eax, DWORD PTR _lo$[esp+16]
    idiv    DWORD PTR _divisor$[esp+12]
    mov DWORD PTR _quotient$[esp+16], eax  ## store to a local temporary, not *premainder
    mov DWORD PTR _tmp$[esp+16], edx
    ## end of __asm block

    mov ecx, DWORD PTR _premainder$[esp+12]
    mov eax, DWORD PTR _tmp$[esp+16]
    mov DWORD PTR [ecx], eax               ## I guess we should have done this inside the inline asm so this would suck slightly less
    mov eax, DWORD PTR _quotient$[esp+16]  ## but this one is unavoidable
    add esp, 16                 ; 00000010H
    ret 8

There's a ton of extra mov instructions, and the compiler doesn't even come close to optimizing any of it away. I thought maybe it would see and understand the mov tmp, edx inside the inline asm, and make that a store to premainder. But that would require loading premainder from the stack into a register before the inline asm block, I guess.

This function is actually worse with _vectorcall than with the normal everything-on-the-stack ABI. With two inputs in registers, it stores them to memory so the inline asm can load them from named variables. If this were inlined, even more of the parameters could potentially be in the regs, and it would have to store them all, so the asm would have memory operands! So unlike gcc, we don't gain much from inlining this.

Doing *premainder = tmp inside the asm block means more code written in asm, but does avoid the totally braindead store/load/store path for the remainder. This reduces the instruction count by 2 total, down to 11 (not including the ret).

I'm trying to get the best possible code out of MSVC, not "use it wrong" and create a straw-man argument. But AFAICT it's horrible for wrapping very short sequences. Presumably there's an intrinsic function for 64/32 -> 32 division that allows the compiler to generate good code for this particular case, so the entire premise of using inline asm for this on MSVC could be a straw-man argument. But it does show you that intrinsics are much better than inline asm for MSVC.


GNU C (gcc/clang/icc)

Gcc does even better than the output shown here when inlining div64, because it can typically arrange for the preceding code to generate the 64bit integer in edx:eax in the first place.

I can't get gcc to compile for the 32bit vectorcall ABI. Clang can, but it sucks at inline asm with "rm" constraints (try it on the godbolt link: it bounces function arg through memory instead of using the register option in the constraint). The 64bit MS calling convention is close to the 32bit vectorcall, with the first two params in edx, ecx. The difference is that 2 more params go in regs before using the stack (and that the callee doesn't pop the args off the stack, which is what the ret 8 was about in the MSVC output.)

// GNU C
// change everything to int64_t to do 128b/64b -> 64b division
// MSVC doesn't do x86-64 inline asm, so we'll use 32bit to be comparable
int div64(int lo, int hi, int *premainder, int divisor) {
    int quotient, rem;
    asm ("idivl  %[divsrc]"
          : "=a" (quotient), "=d" (rem)    // a means eax,  d means edx
          : "d" (hi), "a" (lo),
            [divsrc] "rm" (divisor)        // Could have just used %0 instead of naming divsrc
            // note the "rm" to allow the src to be in a register or not, whatever gcc chooses.
            // "rmi" would also allow an immediate, but unlike adc, idiv doesn't have an immediate form
          : // no clobbers
        );
    *premainder = rem;
    return quotient;
}

compiled with gcc -m64 -O3 -mabi=ms -fverbose-asm. With -m32 you just get 3 loads, idiv, and a store, as you can see from changing stuff in that godbolt link.

mov     eax, ecx  # lo, lo
idivl  r9d      # divisor
mov     DWORD PTR [r8], edx       # *premainder_7(D), rem
ret

For 32bit vectorcall, gcc would do something like

## Not real compiler output, but probably similar to what you'd get
mov     eax, ecx               # lo, lo
mov     ecx, [esp+12]          # premainder
idivl   [esp+16]               # divisor
mov     DWORD PTR [ecx], edx   # *premainder_7(D), rem
ret   8

MSVC uses 13 instructions (not including the ret), compared to gcc's 4. With inlining, as I said, it potentially compiles to just one, while MSVC would still use probably 9. (It won't need to reserve stack space or load premainder; I'm assuming it still has to store about 2 of the 3 inputs. Then it reloads them inside the asm, runs idiv, stores two outputs, and reloads them outside the asm. So that's 4 loads/stores for input, and another 4 for output.)

又怨 2024-09-18 08:48:26

您使用哪一种取决于您的编译器。这不像 C 语言那样标准。

Which one you use depends on your compiler. This isn't standard like the C language.

呢古 2024-09-18 08:48:26

asm 与 GCC 中的 __asm__

asm 不适用于 -std=c99,您有两种选择:

  • 使用 __asm__
  • 使用 -std=gnu99

更多详细信息:错误:“asm”未声明(在此函数中首次使用)

__asm__asm__ 在 GCC

我找不到 __asm 的文档(特别是 https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords ),但来自 GCC 8.1 源 它们完全相同:

  { "__asm",        RID_ASM,    0 },
  { "__asm__",      RID_ASM,    0 },

所以我只使用已记录的 __asm__

asm vs __asm__ in GCC

asm does not work with -std=c99, you have two alternatives:

  • use __asm__
  • use -std=gnu99

More details: error: ‘asm’ undeclared (first use in this function)

__asm vs __asm__ in GCC

I could not find where __asm is documented (notably not mentioned at https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords ), but from the GCC 8.1 source they are exactly the same:

  { "__asm",        RID_ASM,    0 },
  { "__asm__",      RID_ASM,    0 },

so I would just use __asm__ which is documented.

梦旅人picnic 2024-09-18 08:48:26

使用gcc编译器,差别不大。 asm__asm__asm__ 是相同的,它们只是用于避免命名空间冲突(有名为 asm 的用户定义函数等)

With gcc compiler, it's not a big difference. asm or __asm or __asm__ are same, they just use to avoid conflict namespace purpose (there's user defined function that name asm, etc.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文