求模 (%) 的 GCC 实现是如何工作的,为什么它不使用 div 指令?
我试图弄清楚如何在汇编中计算模 10,因此我在 gcc 中编译了以下 c 代码,看看它会产生什么结果。
unsigned int i=999;
unsigned int j=i%10;
令我惊讶的是,我得到了
movl -4(%ebp), %ecx
movl $-858993459, %edx
movl %ecx, %eax
mull %edx
shrl $3, %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
addl %eax, %eax
movl %ecx, %edx
subl %eax, %edx
movl %edx, %eax
movl %eax, -12(%ebp)
其中 -4(%ebp) 或“i”是输入,-12(%ebp) 或“j”是答案。我已经对此进行了测试,无论您将 -4(%ebp) 设置为多少,它都可以正常工作。
我的问题是这段代码是如何工作的以及它比使用 div 操作数如何更好。
I was trying to work out how to calculate modulo 10 in assembly so i compiled the following c code in gcc to see what it came up with.
unsigned int i=999;
unsigned int j=i%10;
To my surprise I got
movl -4(%ebp), %ecx
movl $-858993459, %edx
movl %ecx, %eax
mull %edx
shrl $3, %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
addl %eax, %eax
movl %ecx, %edx
subl %eax, %edx
movl %edx, %eax
movl %eax, -12(%ebp)
Where -4(%ebp) or "i" is the input and -12(%ebp) or "j" is the answer. I've tested this and it does work no matter what number you make -4(%ebp).
My question is how does this code work and how is it better than using the div operand.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
第一个问题:
div
是一条非常慢的指令(超过 20 个时钟周期)。上面的序列包含更多指令,但它们都相对较快,因此就速度而言,这是一个净胜利。前五个指令(直到并包括
shrl
)计算 i/10(我将在一分钟内解释如何进行)。接下来的几条指令再次将结果乘以 10,但避免使用
mul
/imul
指令(这是否成功取决于您所针对的确切处理器 -较新的 x86 具有非常快的乘法器,但较旧的 x86 则没有)。然后再次从
i
中减去该值以获得i - (i/10)*10
,即i % 10
(对于无符号数)。最后,关于i/10的计算:基本思想是用乘以1/10代替除以10。编译器通过乘以 (2**35 / 10 + 1) 来对此进行定点近似 - 这是加载到 edx 中的神奇值,尽管它是作为有符号值输出的,尽管它实际上是无符号 - 并将结果右移 35。事实证明,这为所有 32 位整数提供了正确的结果。
有一些算法可以确定这种近似值,保证误差小于 1(这对于整数来说意味着它是正确的值),并且 GCC 显然使用 1:)
最后备注:如果您想实际看到 GCC 计算模数,请将除数变量(例如函数参数),因此它无法进行这种优化。无论如何,在 x86 上,您可以使用
div
计算模。div
期望 edx:eax 中的 64 位被除数(edx 中的高 32 位,eax 中的低 32 位 - 如果您使用的是 32 位,请将 edx 清除为零- 位数字)并将其除以您指定的任何操作数(例如div ebx
将edx:eax
除以ebx
)。它在eax
中返回商,在edx
中返回余数。idiv
对有符号值执行相同的操作。Second question first:
div
is a very slow instruction (more than 20 clock cycles). The sequence above consists of more instructions, but they're all relatively fast, so it's a net win in terms of speed.The first five instructions (up to and including the
shrl
) compute i/10 (I'll explain how in a minute).The next few instructions multiply the result by 10 again, but avoiding the
mul
/imul
instructions (whether this is a win or not depends on the exact processor you're targeting - newer x86s have very fast multipliers, but older ones don't).This is then subtracted from
i
again to obtaini - (i/10)*10
which isi % 10
(for unsigned numbers).Finally, on the computation of i/10: The basic idea is to replace division by 10 with multiplication by 1/10. The compiler does a fixed-point approximation of this by multiplying with (2**35 / 10 + 1) - that's the magic value loaded into
edx
, though it's output as a signed value even though it's really unsigned - and right-shifting the result by 35. This turns out to give the right result for all 32-bit integers.There's algorithms to determine this kind of approximation which guarantee that the error is less than 1 (which for integers means it's the right value) and GCC obviously uses one :)
Final remark: If you want to actually see GCC compute a modulo, make the divisor variable (e.g. a function parameter) so it can't do this kind of optimization. Anyway, on x86, you compute modulo using
div
.div
expects the 64-bit dividend inedx:eax
(high 32 bits in edx, low 32 bits in eax - clear edx to zero if you're working with a 32-bit number) and divides that by whatever operand you specify (e.g.div ebx
dividesedx:eax
byebx
). It returns the quotient ineax
and the remainder inedx
.idiv
does the same for signed values.第一部分,直到
shrl $3, %edx
,实现了整数除以 10 的快速操作。当预先知道要除的数字时,有几种不同的算法可以工作。请注意,858993459 是“0.2 * 2^32”。这样做的原因是,即使指令集中有整数除法指令div
/idiv
,但它通常非常慢,比乘法慢几倍。第二部分通过将除法结果乘以 10 来计算余数(以间接方式,通过移位和加法;大概编译器认为这样会更快),然后从原始数字中减去该结果。
The first part, up to
shrl $3, %edx
, implements a fast integer division by 10. There are a few different algorithms that work when the number by which you divide is known in advance. Note that 858993459 is "0.2 * 2^32". The reason to do this is because, even though there is an integer division instructiondiv
/idiv
in the instruction set, it's typically very slow, several times slower than multiplication.The second part calculates the remainder by multiplying the result of division by 10 (in an indirect way, via shifts and adds; presumably the compiler thinks that it will be faster that way) and then subtracting that from the original number.