为什么海湾合作jcc减去1并比较< = 2? CMP在组装中的功率更快吗?
我正在编写一些代码将屏幕清除为特定颜色。 C ++代码:
void clear_screen(unsigned int color, void *memory, int height, int width) {
unsigned int *pixel = (unsigned int *)memory;
for (auto y = 0; y < height; y++)
for (auto x = 0; x < width; x++)
*pixel++ = color;
}
我使用G ++和OBJCONV生成相应的组件。这就是我得到的,我已经评论了我认为有些线条的作用。
renderer_clear_screen:
push r13
push r12
push rbp
push rdi
push rsi
push rbx
mov r11d, ecx ; move the color into r11d
mov ebx, r8d ; move the height into ebx
mov rcx, rdx ; 000E _ 48: 89. D1st
test r8d, r8d ;
jle _cls_return ; basically, return if width or height is 0
test r9d, r9d ; ( window minimized )
jle _cls_return ;
mov r8d, r9d ; height = width
mov esi, r9d ; esi = width
mov edi, r9d ; edi = width
xor r10d, r10d ; r10d = 0
shr esi, 2 ; esi = width / 2
movd xmm1, r11d ; move the lower 32-bits of the color into xmm1
lea r12d, [r9-1] ; r12d = width - 1
shl rsi, 4 ; 003F _ 48: C1. E6, 04
mov ebp, r8d ; 0043 _ 44: 89. C5
shl rdi, 2 ; 0046 _ 48: C1. E7, 02
pshufd xmm0, xmm1, 0 ; 004A _ 66: 0F 70. C1, 00
shl rbp, 2 ; 004F _ 48: C1. E5, 02
ALIGN 8
?_001: cmp r12d, 2
jbe ?_006 ; if (width - 1 <= 2) { ?_006 }
mov rax, rcx ; 005E _ 48: 89. C8
lea rdx, [rcx+rsi] ; 0061 _ 48: 8D. 14 31
ALIGN 8
?_002: movups oword [rax], xmm0 ; 0068 _ 0F 11. 00
add rax, 16 ; 006B _ 48: 83. C0, 10
cmp rdx, rax ; 006F _ 48: 39. C2
jnz ?_002 ; 0072 _ 75, F4
lea rdx, [rcx+rbp] ; 0074 _ 48: 8D. 14 29
mov eax, r8d ; 0078 _ 44: 89. C0
cmp r9d, r8d ; 007B _ 45: 39. C1
jz ?_004 ; 007E _ 74, 1C
?_003: lea r13d, [rax+1H] ; 0080 _ 44: 8D. 68, 01
mov dword [rdx], r11d ; 0084 _ 44: 89. 1A
cmp r13d, r9d ; 0087 _ 45: 39. CD
jge ?_004 ; 008A _ 7D, 10
add eax, 2 ; 008C _ 83. C0, 02
mov dword [rdx+4H], r11d ; 008F _ 44: 89. 5A, 04
cmp r9d, eax ; 0093 _ 41: 39. C1
jle ?_004 ; 0096 _ 7E, 04
mov dword [rdx+8H], r11d ; 0098 _ 44: 89. 5A, 08
?_004: add r10d, 1 ; 009C _ 41: 83. C2, 01
add rcx, rdi ; 00A0 _ 48: 01. F9
cmp ebx, r10d ; 00A3 _ 44: 39. D3
jnz ?_001 ; 00A6 _ 75, B0
_cls_return:
pop rbx ;
pop rsi ;
pop rdi ;
pop rbp ;
pop r12 ;
pop r13 ; pop all the saved registers
ret ;
?_006: ; Local function
mov rdx, rcx ; 00B1 _ 48: 89. CA
xor eax, eax ; 00B4 _ 31. C0
jmp ?_003 ; 00B6 _ EB, C8
现在,在?_ 001
中,编译器将width -1
与 2 进行比较,这与比较width 3 。我的问题是,使用
-O3
,为什么编译器选择两个而不是三个,然后浪费lea
(移动width -1
) R12D
)。
对我来说,唯一有意义的是,两者的力量以某种方式比较更快。也许是编译器怪癖?
I was writing some code to clear the screen to a particular color. C++ code:
void clear_screen(unsigned int color, void *memory, int height, int width) {
unsigned int *pixel = (unsigned int *)memory;
for (auto y = 0; y < height; y++)
for (auto x = 0; x < width; x++)
*pixel++ = color;
}
I used g++ and objconv to generate the corresponding assembly. This is what I got, and I've commented what I think some of the lines do too.
renderer_clear_screen:
push r13
push r12
push rbp
push rdi
push rsi
push rbx
mov r11d, ecx ; move the color into r11d
mov ebx, r8d ; move the height into ebx
mov rcx, rdx ; 000E _ 48: 89. D1st
test r8d, r8d ;
jle _cls_return ; basically, return if width or height is 0
test r9d, r9d ; ( window minimized )
jle _cls_return ;
mov r8d, r9d ; height = width
mov esi, r9d ; esi = width
mov edi, r9d ; edi = width
xor r10d, r10d ; r10d = 0
shr esi, 2 ; esi = width / 2
movd xmm1, r11d ; move the lower 32-bits of the color into xmm1
lea r12d, [r9-1] ; r12d = width - 1
shl rsi, 4 ; 003F _ 48: C1. E6, 04
mov ebp, r8d ; 0043 _ 44: 89. C5
shl rdi, 2 ; 0046 _ 48: C1. E7, 02
pshufd xmm0, xmm1, 0 ; 004A _ 66: 0F 70. C1, 00
shl rbp, 2 ; 004F _ 48: C1. E5, 02
ALIGN 8
?_001: cmp r12d, 2
jbe ?_006 ; if (width - 1 <= 2) { ?_006 }
mov rax, rcx ; 005E _ 48: 89. C8
lea rdx, [rcx+rsi] ; 0061 _ 48: 8D. 14 31
ALIGN 8
?_002: movups oword [rax], xmm0 ; 0068 _ 0F 11. 00
add rax, 16 ; 006B _ 48: 83. C0, 10
cmp rdx, rax ; 006F _ 48: 39. C2
jnz ?_002 ; 0072 _ 75, F4
lea rdx, [rcx+rbp] ; 0074 _ 48: 8D. 14 29
mov eax, r8d ; 0078 _ 44: 89. C0
cmp r9d, r8d ; 007B _ 45: 39. C1
jz ?_004 ; 007E _ 74, 1C
?_003: lea r13d, [rax+1H] ; 0080 _ 44: 8D. 68, 01
mov dword [rdx], r11d ; 0084 _ 44: 89. 1A
cmp r13d, r9d ; 0087 _ 45: 39. CD
jge ?_004 ; 008A _ 7D, 10
add eax, 2 ; 008C _ 83. C0, 02
mov dword [rdx+4H], r11d ; 008F _ 44: 89. 5A, 04
cmp r9d, eax ; 0093 _ 41: 39. C1
jle ?_004 ; 0096 _ 7E, 04
mov dword [rdx+8H], r11d ; 0098 _ 44: 89. 5A, 08
?_004: add r10d, 1 ; 009C _ 41: 83. C2, 01
add rcx, rdi ; 00A0 _ 48: 01. F9
cmp ebx, r10d ; 00A3 _ 44: 39. D3
jnz ?_001 ; 00A6 _ 75, B0
_cls_return:
pop rbx ;
pop rsi ;
pop rdi ;
pop rbp ;
pop r12 ;
pop r13 ; pop all the saved registers
ret ;
?_006: ; Local function
mov rdx, rcx ; 00B1 _ 48: 89. CA
xor eax, eax ; 00B4 _ 31. C0
jmp ?_003 ; 00B6 _ EB, C8
Now, in ?_001
, the compiler compares width - 1
to 2, which is the same thing as comparing the width
to 3. My question is, with -O3
, why did the compiler choose two instead of three, and waste a lea
(to move width - 1
into r12d
).
The only thing which makes sense to me is that powers of two are somehow faster to compare. Or maybe it's a compiler quirk?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
GCC调整比较常数的通常原因是创建较小的直接现象,这有助于它立即适合任何宽度。 了解(a&gt; = 3)/ GCC在比较中似乎更喜欢较小的直接值。有没有办法避免这种情况?(总是这样做,而不是检查目标ISA上的这个常数是否有用。)这种启发式对大多数ISA都很好,但有时对Aarch64或手臂拇指不适用可以将一些即时编码为比特范围 /位图案,因此并非总是如此,较小的数字更好。
width-1
是不是其中的一部分。-1
是A range Check 自动矢量循环(一次使用Movups
一次16个字节),然后直接进行清理,1..3标量存储。它似乎正在检查
width&gt; = 1&amp;&amp;宽度&lt; = 3
,即需要清理,但总尺寸小于完整的向量宽度。对于width = 0
,它不等于签名或无符号width&lt; = 3
。注意无符号比较:0-1
在2U
上方,因为-1U
是Uint_max。但是它已经排除了
width&lt; = 0
测试R9D,R9D
/jle _Cls_Return
,因此GCC只会更好width&lt; = 3U
而不是做额外的工作以将零从范围检查中排除。 (lea
,并保存/还原R12,该R12否则使用!)(清理也可能看起来过于复杂,例如使用
movq [rdx],xmm0
如果需要超过1个UINT,并且在各种情况下进行一些怪异的分支,如果总尺寸为&gt; = 4 UINT,请执行另一个Movups
范围,可能与以前的商店重叠。)是的,这是一个错过的优化,您可以在 https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (现在您知道这是一个错过的优化;首先弄清楚是否可以避免说明。)
不,它的速度不是更快;
CMP
性能根本不依赖于数据。 (没有整数指令,有时[i] Div
。以及在Zen3之前的AMD CPU,pext
/pdep
。但是,无论如何,并不是简单整数添加/比较/移位内容。顺便说一句,我们可以重现您的 GCC ASM在Godbolt上输出通过告诉它此功能是
__属性__((MS_ABI))
,或者有一个命令行选项,用于设置调用约定默认值。 (这确实仅对查看ASM;它仍在使用GNU/Linux标头和X86-64系统V类型宽度(例如64位long
)。只有适当的mingw(cross-)编译器才能显示 真正做什么。您在定位Windows时
The usual reason for GCC tweaking compare constants is to create smaller immediates, which helps it fit in an immediate of whatever width. Understanding gcc output for if (a>=3) / GCC seems to prefer small immediate values in comparisons. Is there a way to avoid that? (It always does it, instead of checking whether it's actually useful with this constant on the target ISA.) This heuristic works well for most ISAs, but sometimes not for AArch64 or ARM Thumb which can encode some immediates as a bit-range / bit-pattern, so it's not always the case that a smaller-magnitude number is better.
The
width-1
is not part of that. The-1
is part of a range check to skip the auto-vectorized loop (16 bytes at a time withmovups
) and go straight to the cleanup, 1..3 scalar stores.It seems to be checking
width >= 1 && width <= 3
, i.e. cleanup needed but total size is less than a full vector width. It's not equivalent to signed or unsignedwidth <= 3
forwidth=0
. Note the unsigned compare:0 - 1
is above2U
, because-1U
is UINT_MAX.But it already excluded
width <= 0
withtest r9d, r9d
/jle _cls_return
, so it would have been better for GCC to just checkwidth <= 3U
instead of doing extra work to exclude zero from the range-check. (Anlea
, and save/restore of R12 which isn't otherwise used!)(The cleanup could also looks over-complicated, e.g. using
movq [rdx], xmm0
if more than 1 uint is needed, and some weird branching around for various cases. And even better, if the total size is >= 4 uints, just do anothermovups
that ends at the end of the range, possibly overlapping with previous stores.)Yes, this is a missed optimization, you can report it on https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (now that you know it's a missed optimization; it's good that you asked here first instead of filing a bug without first figuring out if the instruction could be avoided.)
No, it's not faster;
cmp
performance is not data-dependent at all. (No integer instructions are, except sometimes[i]div
. And on AMD CPUs before Zen3,pext
/pdep
. But anyway, not simple integer add/compare/shift stuff. See https://uops.info/).And BTW, we can reproduce your GCC asm output on Godbolt by telling it this function is
__attribute__((ms_abi))
, or there's a command-line option to set the calling convention default. (It's really only useful for looking at the asm; it's still using GNU/Linux headers and x86-64 System V type widths like 64-bitlong
. Only a proper MinGW (cross-)compiler could show you what GCC would really do when targeting Windows.)It's GAS
.intel_syntax noprefix
, which is MASM-like, not NASM, but the difference would only be obvious with addressing modes involving global variables.