为什么海湾合作jcc减去1并比较＆lt; = 2？ CMP在组装中的功率更快吗？

发布于 2025-02-04 00:36:41 字数 4193 浏览 3 评论 0原文

我正在编写一些代码将屏幕清除为特定颜色。 C ++代码：

void clear_screen(unsigned int color, void *memory, int height, int width) {
  unsigned int *pixel = (unsigned int *)memory;
  for (auto y = 0; y < height; y++)
    for (auto x = 0; x < width; x++)
      *pixel++ = color;
}

我使用G ++和OBJCONV生成相应的组件。这就是我得到的，我已经评论了我认为有些线条的作用。

renderer_clear_screen:
        push    r13                                     
        push    r12                                     
        push    rbp                                     
        push    rdi                                     
        push    rsi                                     
        push    rbx                                     
        mov     r11d, ecx            ; move the color into r11d
        mov     ebx, r8d             ; move the height into ebx
        mov     rcx, rdx             ; 000E _ 48: 89. D1st
        test    r8d, r8d             ; 
        jle     _cls_return          ; basically, return if width or height is 0
        test    r9d, r9d             ; ( window minimized )
        jle     _cls_return          ;
        mov     r8d, r9d             ; height = width
        mov     esi, r9d             ; esi = width
        mov     edi, r9d             ; edi = width
        xor     r10d, r10d           ; r10d = 0
        shr     esi, 2               ; esi = width / 2
        movd    xmm1, r11d           ; move the lower 32-bits of the color into xmm1
        lea     r12d, [r9-1]         ; r12d = width - 1
        shl     rsi, 4               ; 003F _ 48: C1. E6, 04
        mov     ebp, r8d             ; 0043 _ 44: 89. C5
        shl     rdi, 2               ; 0046 _ 48: C1. E7, 02
        pshufd  xmm0, xmm1, 0        ; 004A _ 66: 0F 70. C1, 00
        shl     rbp, 2               ; 004F _ 48: C1. E5, 02

ALIGN   8
?_001:  cmp     r12d, 2                                
        jbe     ?_006                ; if (width - 1 <= 2) { ?_006 }
        mov     rax, rcx             ; 005E _ 48: 89. C8
        lea     rdx, [rcx+rsi]       ; 0061 _ 48: 8D. 14 31

ALIGN   8
?_002:  movups  oword [rax], xmm0    ; 0068 _ 0F 11. 00
        add     rax, 16              ; 006B _ 48: 83. C0, 10
        cmp     rdx, rax             ; 006F _ 48: 39. C2
        jnz     ?_002                ; 0072 _ 75, F4
        lea     rdx, [rcx+rbp]       ; 0074 _ 48: 8D. 14 29
        mov     eax, r8d             ; 0078 _ 44: 89. C0
        cmp     r9d, r8d             ; 007B _ 45: 39. C1
        jz      ?_004                ; 007E _ 74, 1C
?_003:  lea     r13d, [rax+1H]       ; 0080 _ 44: 8D. 68, 01
        mov     dword [rdx], r11d    ; 0084 _ 44: 89. 1A
        cmp     r13d, r9d            ; 0087 _ 45: 39. CD
        jge     ?_004                ; 008A _ 7D, 10
        add     eax, 2               ; 008C _ 83. C0, 02
        mov     dword [rdx+4H], r11d ; 008F _ 44: 89. 5A, 04
        cmp     r9d, eax             ; 0093 _ 41: 39. C1
        jle     ?_004                ; 0096 _ 7E, 04
        mov     dword [rdx+8H], r11d ; 0098 _ 44: 89. 5A, 08
?_004:  add     r10d, 1              ; 009C _ 41: 83. C2, 01
        add     rcx, rdi             ; 00A0 _ 48: 01. F9
        cmp     ebx, r10d            ; 00A3 _ 44: 39. D3
        jnz     ?_001                ; 00A6 _ 75, B0
_cls_return: 
        pop     rbx                  ;
        pop     rsi                  ;
        pop     rdi                  ;
        pop     rbp                  ;
        pop     r12                  ;
        pop     r13                  ; pop all the saved registers
        ret                          ; 

?_006:  ; Local function
        mov     rdx, rcx             ; 00B1 _ 48: 89. CA
        xor     eax, eax             ; 00B4 _ 31. C0
        jmp     ?_003                ; 00B6 _ EB, C8

现在，在？_ 001中，编译器将width -1与 2 进行比较，这与比较width 3 。我的问题是，使用-O3，为什么编译器选择两个而不是三个，然后浪费lea（移动width -1） R12D）。对我来说，唯一有意义的是，两者的力量以某种方式比较更快。也许是编译器怪癖？

原文

I was writing some code to clear the screen to a particular color. C++ code:

void clear_screen(unsigned int color, void *memory, int height, int width) {
  unsigned int *pixel = (unsigned int *)memory;
  for (auto y = 0; y < height; y++)
    for (auto x = 0; x < width; x++)
      *pixel++ = color;
}

I used g++ and objconv to generate the corresponding assembly. This is what I got, and I've commented what I think some of the lines do too.

renderer_clear_screen:
        push    r13                                     
        push    r12                                     
        push    rbp                                     
        push    rdi                                     
        push    rsi                                     
        push    rbx                                     
        mov     r11d, ecx            ; move the color into r11d
        mov     ebx, r8d             ; move the height into ebx
        mov     rcx, rdx             ; 000E _ 48: 89. D1st
        test    r8d, r8d             ; 
        jle     _cls_return          ; basically, return if width or height is 0
        test    r9d, r9d             ; ( window minimized )
        jle     _cls_return          ;
        mov     r8d, r9d             ; height = width
        mov     esi, r9d             ; esi = width
        mov     edi, r9d             ; edi = width
        xor     r10d, r10d           ; r10d = 0
        shr     esi, 2               ; esi = width / 2
        movd    xmm1, r11d           ; move the lower 32-bits of the color into xmm1
        lea     r12d, [r9-1]         ; r12d = width - 1
        shl     rsi, 4               ; 003F _ 48: C1. E6, 04
        mov     ebp, r8d             ; 0043 _ 44: 89. C5
        shl     rdi, 2               ; 0046 _ 48: C1. E7, 02
        pshufd  xmm0, xmm1, 0        ; 004A _ 66: 0F 70. C1, 00
        shl     rbp, 2               ; 004F _ 48: C1. E5, 02

ALIGN   8
?_001:  cmp     r12d, 2                                
        jbe     ?_006                ; if (width - 1 <= 2) { ?_006 }
        mov     rax, rcx             ; 005E _ 48: 89. C8
        lea     rdx, [rcx+rsi]       ; 0061 _ 48: 8D. 14 31

ALIGN   8
?_002:  movups  oword [rax], xmm0    ; 0068 _ 0F 11. 00
        add     rax, 16              ; 006B _ 48: 83. C0, 10
        cmp     rdx, rax             ; 006F _ 48: 39. C2
        jnz     ?_002                ; 0072 _ 75, F4
        lea     rdx, [rcx+rbp]       ; 0074 _ 48: 8D. 14 29
        mov     eax, r8d             ; 0078 _ 44: 89. C0
        cmp     r9d, r8d             ; 007B _ 45: 39. C1
        jz      ?_004                ; 007E _ 74, 1C
?_003:  lea     r13d, [rax+1H]       ; 0080 _ 44: 8D. 68, 01
        mov     dword [rdx], r11d    ; 0084 _ 44: 89. 1A
        cmp     r13d, r9d            ; 0087 _ 45: 39. CD
        jge     ?_004                ; 008A _ 7D, 10
        add     eax, 2               ; 008C _ 83. C0, 02
        mov     dword [rdx+4H], r11d ; 008F _ 44: 89. 5A, 04
        cmp     r9d, eax             ; 0093 _ 41: 39. C1
        jle     ?_004                ; 0096 _ 7E, 04
        mov     dword [rdx+8H], r11d ; 0098 _ 44: 89. 5A, 08
?_004:  add     r10d, 1              ; 009C _ 41: 83. C2, 01
        add     rcx, rdi             ; 00A0 _ 48: 01. F9
        cmp     ebx, r10d            ; 00A3 _ 44: 39. D3
        jnz     ?_001                ; 00A6 _ 75, B0
_cls_return: 
        pop     rbx                  ;
        pop     rsi                  ;
        pop     rdi                  ;
        pop     rbp                  ;
        pop     r12                  ;
        pop     r13                  ; pop all the saved registers
        ret                          ; 

?_006:  ; Local function
        mov     rdx, rcx             ; 00B1 _ 48: 89. CA
        xor     eax, eax             ; 00B4 _ 31. C0
        jmp     ?_003                ; 00B6 _ EB, C8

Now, in ?_001, the compiler compares width - 1 to 2, which is the same thing as comparing the width to 3. My question is, with -O3, why did the compiler choose two instead of three, and waste a lea (to move width - 1 into r12d).
The only thing which makes sense to me is that powers of two are somehow faster to compare. Or maybe it's a compiler quirk?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深陷 2025-02-11 00:36:41

GCC调整比较常数的通常原因是创建较小的直接现象，这有助于它立即适合任何宽度。了解（a＆gt; = 3）/ GCC在比较中似乎更喜欢较小的直接值。有没有办法避免这种情况？（总是这样做，而不是检查目标ISA上的这个常数是否有用。）这种启发式对大多数ISA都很好，但有时对Aarch64或手臂拇指不适用可以将一些即时编码为比特范围 /位图案，因此并非总是如此，较小的数字更好。

width-1是不是其中的一部分。 -1是A range Check 自动矢量循环（一次使用Movups一次16个字节），然后直接进行清理，1..3标量存储。

它似乎正在检查width＆gt; = 1＆amp;＆amp;宽度＆lt; = 3，即需要清理，但总尺寸小于完整的向量宽度。对于width = 0，它不等于签名或无符号width＆lt; = 3。注意无符号比较：0-1在2U上方，因为-1U是Uint_max。

但是它已经排除了width＆lt; = 0 测试R9D，R9D/jle _Cls_Return，因此GCC只会更好width＆lt; = 3U而不是做额外的工作以将零从范围检查中排除。（lea，并保存/还原R12，该R12否则使用！）

（清理也可能看起来过于复杂，例如使用movq [rdx]，xmm0如果需要超过1个UINT，并且在各种情况下进行一些怪异的分支，如果总尺寸为＆gt; = 4 UINT，请执行另一个Movups范围，可能与以前的商店重叠。）

是的，这是一个错过的优化，您可以在 https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc （现在您知道这是一个错过的优化；首先弄清楚是否可以避免说明。）

对我来说唯一有意义的是两个人的力量以某种方式比较。

不，它的速度不是更快； CMP性能根本不依赖于数据。（没有整数指令，有时[i] Div。以及在Zen3之前的AMD CPU，pext/pdep。但是，无论如何，并不是简单整数添加/比较/移位内容。

顺便说一句，我们可以重现您的 GCC ASM在Godbolt上输出通过告诉它此功能是__属性__（（MS_ABI）），或者有一个命令行选项，用于设置调用约定默认值。（这确实仅对查看ASM；它仍在使用GNU/Linux标头和X86-64系统V类型宽度（例如64位long）。只有适当的mingw（cross-）编译器才能显示真正做什么。

您在定位Windows时

The usual reason for GCC tweaking compare constants is to create smaller immediates, which helps it fit in an immediate of whatever width. Understanding gcc output for if (a>=3) / GCC seems to prefer small immediate values in comparisons. Is there a way to avoid that? (It always does it, instead of checking whether it's actually useful with this constant on the target ISA.) This heuristic works well for most ISAs, but sometimes not for AArch64 or ARM Thumb which can encode some immediates as a bit-range / bit-pattern, so it's not always the case that a smaller-magnitude number is better.

The width-1 is not part of that. The -1 is part of a range check to skip the auto-vectorized loop (16 bytes at a time with movups) and go straight to the cleanup, 1..3 scalar stores.

It seems to be checking width >= 1 && width <= 3, i.e. cleanup needed but total size is less than a full vector width. It's not equivalent to signed or unsigned width <= 3 for width=0. Note the unsigned compare: 0 - 1 is above 2U, because -1U is UINT_MAX.

But it already excluded width <= 0 with test r9d, r9d / jle _cls_return, so it would have been better for GCC to just check width <= 3U instead of doing extra work to exclude zero from the range-check. (An lea, and save/restore of R12 which isn't otherwise used!)

(The cleanup could also looks over-complicated, e.g. using movq [rdx], xmm0 if more than 1 uint is needed, and some weird branching around for various cases. And even better, if the total size is >= 4 uints, just do another movups that ends at the end of the range, possibly overlapping with previous stores.)

Yes, this is a missed optimization, you can report it on https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (now that you know it's a missed optimization; it's good that you asked here first instead of filing a bug without first figuring out if the instruction could be avoided.)

The only thing which makes sense to me is that powers of two are somehow faster to compare.

No, it's not faster; cmp performance is not data-dependent at all. (No integer instructions are, except sometimes [i]div. And on AMD CPUs before Zen3, pext / pdep. But anyway, not simple integer add/compare/shift stuff. See https://uops.info/).

And BTW, we can reproduce your GCC asm output on Godbolt by telling it this function is __attribute__((ms_abi)), or there's a command-line option to set the calling convention default. (It's really only useful for looking at the asm; it's still using GNU/Linux headers and x86-64 System V type widths like 64-bit long. Only a proper MinGW (cross-)compiler could show you what GCC would really do when targeting Windows.)

It's GAS .intel_syntax noprefix, which is MASM-like, not NASM, but the difference would only be obvious with addressing modes involving global variables.

回复收藏 0 原文

~没有更多了~