求模 (%) 的 GCC 实现是如何工作的,为什么它不使用 div 指令?

发布于 2024-10-06 07:59:40 字数 554 浏览 2 评论 0原文

我试图弄清楚如何在汇编中计算模 10,因此我在 gcc 中编译了以下 c 代码,看看它会产生什么结果。

unsigned int i=999;
unsigned int j=i%10;

令我惊讶的是,我得到了

movl    -4(%ebp), %ecx
movl    $-858993459, %edx
movl    %ecx, %eax
mull    %edx
shrl    $3, %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
addl    %eax, %eax
movl    %ecx, %edx
subl    %eax, %edx
movl    %edx, %eax
movl    %eax, -12(%ebp)

其中 -4(%ebp) 或“i”是输入,-12(%ebp) 或“j”是答案。我已经对此进行了测试,无论您将 -4(%ebp) 设置为多少,它都可以正常工作。

我的问题是这段代码是如何工作的以及它比使用 div 操作数如何更好。

I was trying to work out how to calculate modulo 10 in assembly so i compiled the following c code in gcc to see what it came up with.

unsigned int i=999;
unsigned int j=i%10;

To my surprise I got

movl    -4(%ebp), %ecx
movl    $-858993459, %edx
movl    %ecx, %eax
mull    %edx
shrl    $3, %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
addl    %eax, %eax
movl    %ecx, %edx
subl    %eax, %edx
movl    %edx, %eax
movl    %eax, -12(%ebp)

Where -4(%ebp) or "i" is the input and -12(%ebp) or "j" is the answer. I've tested this and it does work no matter what number you make -4(%ebp).

My question is how does this code work and how is it better than using the div operand.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

凡间太子 2024-10-13 07:59:40

第一个问题:div 是一条非常慢的指令(超过 20 个时钟周期)。上面的序列包含更多指令,但它们都相对较快,因此就速度而言,这是一个净胜利。

前五个指令(直到并包括 shr​​l)计算 i/10(我将在一分钟内解释如何进行)。

接下来的几条指令再次将结果乘以 10,但避免使用 mul/imul 指令(这是否成功取决于您所针对的确切处理器 -较新的 x86 具有非常快的乘法器,但较旧的 x86 则没有)。

movl    %edx, %eax   ; eax=i/10
sall    $2, %eax     ; eax=(i/10)*4
addl    %edx, %eax   ; eax=(i/10)*4 + (i/10) = (i/10)*5
addl    %eax, %eax   ; eax=(i/10)*5*2 = (i/10)*10

然后再次从i中减去该值以获得i - (i/10)*10,即i % 10(对于无符号数)。

最后,关于i/10的计算:基本思想是用乘以1/10代替除以10。编译器通过乘以 (2**35 / 10 + 1) 来对此进行定点近似 - 这是加载到 edx 中的神奇值,尽管它是作为有符号值输出的,尽管它实际上是无符号 - 并将结果右移 35。事实证明,这为所有 32 位整数提供了正确的结果。

有一些算法可以确定这种近似值,保证误差小于 1(这对于整数来说意味着它是正确的值),并且 GCC 显然使用 1:)

最后备注:如果您想实际看到 GCC 计算模数,请将除数变量(例如函数参数),因此它无法进行这种优化。无论如何,在 x86 上,您可以使用 div 计算模。 div 期望 edx:eax 中的 64 位被除数(edx 中的高 32 位,eax 中的低 32 位 - 如果您使用的是 32 位,请将 edx 清除为零- 位数字)并将其除以您指定的任何操作数(例如 div ebxedx:eax 除以 ebx)。它在 eax 中返回商,在 edx 中返回余数。 idiv 对有符号值执行相同的操作。

Second question first: div is a very slow instruction (more than 20 clock cycles). The sequence above consists of more instructions, but they're all relatively fast, so it's a net win in terms of speed.

The first five instructions (up to and including the shrl) compute i/10 (I'll explain how in a minute).

The next few instructions multiply the result by 10 again, but avoiding the mul/imul instructions (whether this is a win or not depends on the exact processor you're targeting - newer x86s have very fast multipliers, but older ones don't).

movl    %edx, %eax   ; eax=i/10
sall    $2, %eax     ; eax=(i/10)*4
addl    %edx, %eax   ; eax=(i/10)*4 + (i/10) = (i/10)*5
addl    %eax, %eax   ; eax=(i/10)*5*2 = (i/10)*10

This is then subtracted from i again to obtain i - (i/10)*10 which is i % 10 (for unsigned numbers).

Finally, on the computation of i/10: The basic idea is to replace division by 10 with multiplication by 1/10. The compiler does a fixed-point approximation of this by multiplying with (2**35 / 10 + 1) - that's the magic value loaded into edx, though it's output as a signed value even though it's really unsigned - and right-shifting the result by 35. This turns out to give the right result for all 32-bit integers.

There's algorithms to determine this kind of approximation which guarantee that the error is less than 1 (which for integers means it's the right value) and GCC obviously uses one :)

Final remark: If you want to actually see GCC compute a modulo, make the divisor variable (e.g. a function parameter) so it can't do this kind of optimization. Anyway, on x86, you compute modulo using div. div expects the 64-bit dividend in edx:eax (high 32 bits in edx, low 32 bits in eax - clear edx to zero if you're working with a 32-bit number) and divides that by whatever operand you specify (e.g. div ebx divides edx:eax by ebx). It returns the quotient in eax and the remainder in edx. idiv does the same for signed values.

清风夜微凉 2024-10-13 07:59:40

第一部分,直到 shr​​l $3, %edx,实现了整数除以 10 的快速操作。当预先知道要除的数字时,有几种不同的算法可以工作。请注意,858993459 是“0.2 * 2^32”。这样做的原因是,即使指令集中有整数除法指令div/idiv,但它通常非常慢,比乘法慢几倍。

第二部分通过将除法结果乘以 10 来计算余数(以间接方式,通过移位和加法;大概编译器认为这样会更快),然后从原始数字中减去该结果。

The first part, up to shrl $3, %edx, implements a fast integer division by 10. There are a few different algorithms that work when the number by which you divide is known in advance. Note that 858993459 is "0.2 * 2^32". The reason to do this is because, even though there is an integer division instruction div/idiv in the instruction set, it's typically very slow, several times slower than multiplication.

The second part calculates the remainder by multiplying the result of division by 10 (in an indirect way, via shifts and adds; presumably the compiler thinks that it will be faster that way) and then subtracting that from the original number.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文