使用位移位除以 10?
是否可以通过使用纯位移位、加法、减法和乘法来将无符号整数除以 10?使用资源非常有限且除法缓慢的处理器。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
是否可以通过使用纯位移位、加法、减法和乘法来将无符号整数除以 10?使用资源非常有限且除法缓慢的处理器。
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(11)
编者注:这实际上不是编译器所做的,并且对于以 9 结尾的大正整数给出了错误的答案,以
div10(1073741829) = 107374183
而不是开头107374182
(Godbolt)。不过,对于小于0x40000005
的输入来说,它是准确的,这对于某些用途来说可能就足够了。编译器(包括 MSVC)确实对常数除数使用定点乘法逆元,但它们使用不同的魔术常数并对高半结果进行移位,以获得所有可能输入的精确结果,这与 C 抽象机的要求相匹配。请参阅Granlund 和蒙哥马利关于该算法的论文。
请参阅 为什么 GCC 使用乘法在实现整数除法时被一个奇怪的数字? 获取实际 x86 asm gcc、clang、MSVC、ICC 和其他现代编译器的示例。
这是一种快速近似,对于大输入来说并不精确
。它甚至比编译器使用的通过乘法+右移进行的精确除法还要快。
您可以使用乘法结果的高半部分除以小积分常量。假设一台 32 位机器(代码可以相应调整):
这里我们乘以 1/10 * 2^32 的近似值,然后删除 2^32。这种方法可以适应不同的除数和不同的位宽度。
这对于 ia32 架构非常有用,因为它的 IMUL 指令会将 64 位乘积放入 edx:eax 中,而 edx 值将是所需的值。即(假设被除数在 ecx(快速调用)中传递,商在 eax 中返回)
即使在具有慢乘法指令的机器上,这也会比软件甚至硬件除法更快。
Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with
div10(1073741829) = 107374183
not107374182
(Godbolt). It is exact for inputs smaller than0x40000005
, though, which may be sufficient for some uses.Compilers (including MSVC) do use fixed-point multiplicative inverses for constant divisors, but they use a different magic constant and shift on the high-half result to get an exact result for all possible inputs, matching what the C abstract machine requires. See Granlund & Montgomery's paper on the algorithm.
See Why does GCC use multiplication by a strange number in implementing integer division? for examples of the actual x86 asm gcc, clang, MSVC, ICC, and other modern compilers make.
This is a fast approximation that's inexact for large inputs
It's even faster than the exact division via multiply + right-shift that compilers use.
You can use the high half of a multiply result for divisions by small integral constants. Assume a 32-bit machine (code can be adjusted accordingly):
What's going here is that we're multiplying by a close approximation of 1/10 * 2^32 and then removing the 2^32. This approach can be adapted to different divisors and different bit widths.
This works great for the ia32 architecture, since its IMUL instruction will put the 64-bit product into edx:eax, and the edx value will be the wanted value. Viz (assuming dividend is passed in ecx (fastcall) and quotient returned in eax)
Even on a machine with a slow multiply instruction, this will be faster than a software or even hardware divide.
虽然到目前为止给出的答案与实际问题相符,但与标题不符。因此,这是一个深受 黑客之乐 实际上只使用位移位。
我认为对于缺乏乘法指令的架构来说这是最好的解决方案。
Though the answers given so far match the actual question, they do not match the title. So here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts.
I think that this is the best solution for architectures that lack a multiply instruction.
当然可以,如果您可以忍受精度上的一些损失。如果您知道输入值的值范围,您可以提出精确的位移和乘法。
一些如何除以 10、60 等的示例,就像本博客中描述的格式 尽可能以最快的方式计时。
Of course you can if you can live with some loss in precision. If you know the value range of your input values you can come up with a bit shift and a multiplication which is exact.
Some examples how you can divide by 10, 60, ... like it is described in this blog to format time the fastest way possible.
为了稍微扩展阿洛伊斯的答案,我们可以扩展建议的 y = (x * 205) >>> 11 对于更多的倍数/移位:
每一行都是一个独立的计算,您将在注释中显示的值处看到第一个“错误”/不正确的结果。通常最好对给定的误差值采用最小的移位,因为这将最大限度地减少计算中存储中间值所需的额外位,例如
(x * 13) >>> 7
比(x * 52) >> “更好” 9
因为它需要少两位开销,而在 68 以上都开始给出错误答案。如果你想计算更多这些,可以使用以下(Python)代码:
我做了明显的事情计算此近似值何时开始出错:(
请注意
//
用于“整数”除法,即它向零截断/舍入)错误中“3/1”模式的原因(即 8 重复 3 次,然后是 9)似乎是由于碱基的变化,即
log2(10)
约为 3.32。如果我们绘制错误,我们会得到以下结果:其中相对误差由以下公式给出:
mul_from_shift(shift) / (1<
to expand Alois's answer a bit, we can expand the suggested
y = (x * 205) >> 11
for a few more multiples/shifts:each line is a single, independent, calculation, and you'll see your first "error"/incorrect result at the value shown in the comment. you're generally better off taking the smallest shift for a given error value as this will minimise the extra bits needed to store the intermediate value in the calculation, e.g.
(x * 13) >> 7
is "better" than(x * 52) >> 9
as it needs two less bits of overhead, while both start to give wrong answers above 68.if you want to calculate more of these, the following (Python) code can be used:
and I did the obvious thing for calculating when this approximation starts to go wrong with:
(note that
//
is used for "integer" division, i.e. it truncates/rounds towards zero)the reason for the "3/1" pattern in errors (i.e. 8 repeats 3 times followed by 9) seems to be due to the change in bases, i.e.
log2(10)
is ~3.32. if we plot the errors we get the following:where the relative error is given by:
mul_from_shift(shift) / (1<<shift) - 0.1
考虑到 Kuba Ober 的回应,还有另一个同样的回应。
它使用结果的迭代近似,但我不期望任何令人惊讶的性能。
假设我们必须找到
x
,其中x = v / 10
。我们将使用逆运算
v = x * 10
,因为它具有一个很好的属性,即当x = a + b
时,则x * 10 = a * 10 + b * 10。
让我们使用 x 作为变量来保存迄今为止结果的最佳近似值。当搜索结束时,
x
将保存结果。我们将x
的每一位b
从最高有效位到最低有效位依次设置,最后比较(x + b) * 10
(x + b) * 10(x + b) * 10(x + b) * 10代码> 与v
。如果它小于或等于v
,则在x
中设置位b
。为了测试下一位,我们只需将 b 向右移动一个位置(除以二)。我们可以通过将
x * 10
和b * 10
保存在其他变量中来避免乘以 10。这产生了以下将
v
除以 10 的算法。编辑:为了获得 Kuba Ober 的算法,它避免了变量
x10
的需要,我们可以从v
和v10
中减去b10
。在这种情况下,不再需要x10
。该算法变为可以展开循环,并且可以将
b
和b10
的不同值预先计算为常量。Considering Kuba Ober’s response, there is another one in the same vein.
It uses iterative approximation of the result, but I wouldn’t expect any surprising performances.
Let say we have to find
x
wherex = v / 10
.We’ll use the inverse operation
v = x * 10
because it has the nice property that whenx = a + b
, thenx * 10 = a * 10 + b * 10
.Let use
x
as variable holding the best approximation of result so far. When the search ends,x
Will hold the result. We’ll set each bitb
ofx
from the most significant to the less significant, one by one, end compare(x + b) * 10
withv
. If its smaller or equal tov
, then the bitb
is set inx
. To test the next bit, we simply shift b one position to the right (divide by two).We can avoid the multiplication by 10 by holding
x * 10
andb * 10
in other variables.This yields the following algorithm to divide
v
by 10.Edit: to get the algorithm of Kuba Ober which avoids the need of variable
x10
, we can subtractb10
fromv
andv10
instead. In this casex10
isn’t needed anymore. The algorithm becomesThe loop may be unrolled and the different values of
b
andb10
may be precomputed as constants.在一次只能移动一个位置的架构上,对 2 乘以 10 的递减幂进行一系列显式比较可能比黑客满意的解决方案效果更好。假设 16 位被除数:
On architectures that can only shift one place at a time, a series of explicit comparisons against decreasing powers of two multiplied by 10 might work better than the solution form hacker's delight. Assuming a 16 bit dividend:
那么除法就是减法,所以是的。右移 1(除以 2)。现在从结果中减去 5,计算减法的次数,直到该值小于 5。结果就是您进行的减法次数。哦,划分可能会更快。
如果除法器中的逻辑尚未为您执行此操作,则先右移然后使用正常除法除以 5 的混合策略可能会提高性能。
Well division is subtraction, so yes. Shift right by 1 (divide by 2). Now subtract 5 from the result, counting the number of times you do the subtraction until the value is less than 5. The result is number of subtractions you did. Oh, and dividing is probably going to be faster.
A hybrid strategy of shift right then divide by 5 using the normal division might get you a performance improvement if the logic in the divider doesn't already do this for you.
基于实时的答案,这里有一个基于Python的方法,支持无限精度:
输出:
注意:这并不快,只是使用会破坏此处大多数答案(包括引用的答案)的值。
编辑:基于John Källén的答案的替代方法,它比以前的代码更小并且可能更快:
输出与之前的代码。
为了稍微解释一下代码,它的工作原理是在
l
范围内复制invDivisor
的模式(l = 8 << power
基于除数位长度)。基本上,如果
n
是 44 位,那么 l 就是 64 位,其中
invDivisor
计算为( 0x3333333333333333 >> 1 ) ^ 3
,结果为0x199999999999999A
。它在任何位深度下都表现得非常好。
Based on realtime's answer, here's a python-based approach that supports infinite precision:
output:
NOTE: This is not exactly fast, it just works with values that would break most answers here, including the referenced answer.
EDIT: Alternative approach based on John Källén's answer which is smaller and potentially faster than the previous code:
The output is the same as the previous code.
To explain the code a bit, it works via replicating the pattern of
invDivisor
within the range ofl
(l = 8 << power
based on divisor bit length).Basically if
n
is 44 bits then l is 64 bits,where
invDivisor
is calculated as( 0x3333333333333333 >> 1 ) ^ 3
which results in0x199999999999999A
.It works surprisingly well at any bit depth.
对于无符号字节 - 小于 256 或 28 的数字:
对于小于 1029 (210+5) 的无符号数字(改编自 @AloisKraus 的答案):
对于小于65536 (216):
对于小于 4294967296 (232) 的数字:
对于所有 64 位数字(需要 GCC 扩展):
For unsigned bytes—numbers less than 256 or 28:
For unsigned numbers less than 1029 (210+5) (adapted from @AloisKraus's answer):
For numbers less than 65536 (216):
For numbers less than 4294967296 (232):
For all 64-bit numbers (requires GCC extensions):
我在AVR汇编中设计了一种新方法,仅使用lsr/ror和sub/sbc。它除以 8,然后减去除以 64 和 128 的数,然后减去第 1,024 个和第 2,048 个,依此类推。工作非常可靠(包括精确舍入)且快速(1 MHz 时为 370 微秒)。
16位数字的源代码在这里:
http://www.avr-asm-tutorial.net/ avr_cn/beginner/DIV10/div10_16rd.asm
注释此源代码的页面在这里:
http://www.avr-asm-tutorial.net/ avr_cn/beginner/DIV10/DIV10.html
我希望它能有所帮助,尽管这个问题已经有十年了。
BRGS、GSC
I've designed a new method in AVR assembly, with lsr/ror and sub/sbc only. It divides by 8, then sutracts the number divided by 64 and 128, then subtracts the 1,024th and the 2,048th, and so on and so on. Works very reliable (includes exact rounding) and quick (370 microseconds at 1 MHz).
The source code is here for 16-bit-numbers:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/div10_16rd.asm
The page that comments this source code is here:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/DIV10.html
I hope that it helps, even though the question is ten years old.
brgs, gsc
elemakil 的评论代码可以在这里找到: https://doc.lagout.org/security /Hackers%20Delight.pdf
第 233 页。“无符号除以 10 [和 11。]”
elemakil's comments' code can be found here: https://doc.lagout.org/security/Hackers%20Delight.pdf
page 233. "Unsigned divide by 10 [and 11.]"