使用位移位除以 10？

帅哥哥的热头脑 2024-11-06 18:20:39

编者注：这实际上不是编译器所做的，并且对于以 9 结尾的大正整数给出了错误的答案，以 div10(1073741829) = 107374183 而不是 开头107374182 (Godbolt)。不过，对于小于 0x40000005 的输入来说，它是准确的，这对于某些用途来说可能就足够了。

编译器（包括 MSVC）确实对常数除数使用定点乘法逆元，但它们使用不同的魔术常数并对高半结果进行移位，以获得所有可能输入的精确结果，这与 C 抽象机的要求相匹配。请参阅Granlund 和蒙哥马利关于该算法的论文。

请参阅为什么 GCC 使用乘法在实现整数除法时被一个奇怪的数字？获取实际 x86 asm gcc、clang、MSVC、ICC 和其他现代编译器的示例。

这是一种快速近似，对于大输入来说并不精确

。它甚至比编译器使用的通过乘法+右移进行的精确除法还要快。

您可以使用乘法结果的高半部分除以小积分常量。假设一台 32 位机器（代码可以相应调整）：

int32_t div10(int32_t dividend)
{
    int64_t invDivisor = 0x1999999A;
    return (int32_t) ((invDivisor * dividend) >> 32);
}

这里我们乘以 1/10 * 2^32 的近似值，然后删除 2^32。这种方法可以适应不同的除数和不同的位宽度。

这对于 ia32 架构非常有用，因为它的 IMUL 指令会将 64 位乘积放入 edx:eax 中，而 edx 值将是所需的值。即（假设被除数在 ecx（快速调用）中传递，商在 eax 中返回）

div10 proc 
    mov    eax,1999999Ah    ; 1/10 * 2^32
    imul   ecx              ; edx:eax = dividend / 10 * 2 ^32
    mov    eax,edx          ; eax = dividend / 10
    ret
endp

即使在具有慢乘法指令的机器上，这也会比软件甚至硬件除法更快。

Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182 (Godbolt). It is exact for inputs smaller than 0x40000005, though, which may be sufficient for some uses.

Compilers (including MSVC) do use fixed-point multiplicative inverses for constant divisors, but they use a different magic constant and shift on the high-half result to get an exact result for all possible inputs, matching what the C abstract machine requires. See Granlund & Montgomery's paper on the algorithm.

See Why does GCC use multiplication by a strange number in implementing integer division? for examples of the actual x86 asm gcc, clang, MSVC, ICC, and other modern compilers make.

This is a fast approximation that's inexact for large inputs

It's even faster than the exact division via multiply + right-shift that compilers use.

You can use the high half of a multiply result for divisions by small integral constants. Assume a 32-bit machine (code can be adjusted accordingly):

int32_t div10(int32_t dividend)
{
    int64_t invDivisor = 0x1999999A;
    return (int32_t) ((invDivisor * dividend) >> 32);
}

What's going here is that we're multiplying by a close approximation of 1/10 * 2^32 and then removing the 2^32. This approach can be adapted to different divisors and different bit widths.

This works great for the ia32 architecture, since its IMUL instruction will put the 64-bit product into edx:eax, and the edx value will be the wanted value. Viz (assuming dividend is passed in ecx (fastcall) and quotient returned in eax)

div10 proc 
    mov    eax,1999999Ah    ; 1/10 * 2^32
    imul   ecx              ; edx:eax = dividend / 10 * 2 ^32
    mov    eax,edx          ; eax = dividend / 10
    ret
endp

Even on a machine with a slow multiply instruction, this will be faster than a software or even hardware divide.

回复收藏 0 原文

别忘他 2024-11-06 18:20:39

虽然到目前为止给出的答案与实际问题相符，但与标题不符。因此，这是一个深受黑客之乐实际上只使用位移位。

unsigned divu10(unsigned n) {
    unsigned q, r;
    q = (n >> 1) + (n >> 2);
    q = q + (q >> 4);
    q = q + (q >> 8);
    q = q + (q >> 16);
    q = q >> 3;
    r = n - (((q << 2) + q) << 1);
    return q + (r > 9);
}

我认为对于缺乏乘法指令的架构来说这是最好的解决方案。

Though the answers given so far match the actual question, they do not match the title. So here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts.

unsigned divu10(unsigned n) {
    unsigned q, r;
    q = (n >> 1) + (n >> 2);
    q = q + (q >> 4);
    q = q + (q >> 8);
    q = q + (q >> 16);
    q = q >> 3;
    r = n - (((q << 2) + q) << 1);
    return q + (r > 9);
}

I think that this is the best solution for architectures that lack a multiply instruction.

回复收藏 0 原文

美胚控场 2024-11-06 18:20:39

当然可以，如果您可以忍受精度上的一些损失。如果您知道输入值的值范围，您可以提出精确的位移和乘法。
一些如何除以 10、60 等的示例，就像本博客中描述的格式尽可能以最快的方式计时。

temp = (ms * 205) >> 11;  // 205/2048 is nearly the same as /10

Of course you can if you can live with some loss in precision. If you know the value range of your input values you can come up with a bit shift and a multiplication which is exact.
Some examples how you can divide by 10, 60, ... like it is described in this blog to format time the fastest way possible.

temp = (ms * 205) >> 11;  // 205/2048 is nearly the same as /10

回复收藏 0 原文

夏尔 2024-11-06 18:20:39

为了稍微扩展阿洛伊斯的答案，我们可以扩展建议的 y = (x * 205) >>> 11 对于更多的倍数/移位：

y = (ms *        1) >>  3 // first error 8
y = (ms *        2) >>  4 // 8
y = (ms *        4) >>  5 // 8
y = (ms *        7) >>  6 // 19
y = (ms *       13) >>  7 // 69
y = (ms *       26) >>  8 // 69
y = (ms *       52) >>  9 // 69
y = (ms *      103) >> 10 // 179
y = (ms *      205) >> 11 // 1029
y = (ms *      410) >> 12 // 1029
y = (ms *      820) >> 13 // 1029
y = (ms *     1639) >> 14 // 2739
y = (ms *     3277) >> 15 // 16389
y = (ms *     6554) >> 16 // 16389
y = (ms *    13108) >> 17 // 16389
y = (ms *    26215) >> 18 // 43699
y = (ms *    52429) >> 19 // 262149
y = (ms *   104858) >> 20 // 262149
y = (ms *   209716) >> 21 // 262149
y = (ms *   419431) >> 22 // 699059
y = (ms *   838861) >> 23 // 4194309
y = (ms *  1677722) >> 24 // 4194309
y = (ms *  3355444) >> 25 // 4194309
y = (ms *  6710887) >> 26 // 11184819
y = (ms * 13421773) >> 27 // 67108869

每一行都是一个独立的计算，您将在注释中显示的值处看到第一个“错误”/不正确的结果。通常最好对给定的误差值采用最小的移位，因为这将最大限度地减少计算中存储中间值所需的额外位，例如 (x * 13) >>> 7 比 (x * 52) >> “更好” 9 因为它需要少两位开销，而在 68 以上都开始给出错误答案。

如果你想计算更多这些，可以使用以下（Python）代码：

def mul_from_shift(shift):
    mid = 2**shift + 5.
    return int(round(mid / 10.))

我做了明显的事情计算此近似值何时开始出错：（

def first_err(mul, shift):
    i = 1
    while True:
        y = (i * mul) >> shift
        if y != i // 10:
            return i
        i += 1

请注意 // 用于“整数”除法，即它向零截断/舍入）

错误中“3/1”模式的原因（即 8 重复 3 次，然后是 9）似乎是由于碱基的变化，即 log2(10) 约为 3.32。如果我们绘制错误，我们会得到以下结果：

其中相对误差由以下公式给出：mul_from_shift(shift) / (1<

to expand Alois's answer a bit, we can expand the suggested y = (x * 205) >> 11 for a few more multiples/shifts:

y = (ms *        1) >>  3 // first error 8
y = (ms *        2) >>  4 // 8
y = (ms *        4) >>  5 // 8
y = (ms *        7) >>  6 // 19
y = (ms *       13) >>  7 // 69
y = (ms *       26) >>  8 // 69
y = (ms *       52) >>  9 // 69
y = (ms *      103) >> 10 // 179
y = (ms *      205) >> 11 // 1029
y = (ms *      410) >> 12 // 1029
y = (ms *      820) >> 13 // 1029
y = (ms *     1639) >> 14 // 2739
y = (ms *     3277) >> 15 // 16389
y = (ms *     6554) >> 16 // 16389
y = (ms *    13108) >> 17 // 16389
y = (ms *    26215) >> 18 // 43699
y = (ms *    52429) >> 19 // 262149
y = (ms *   104858) >> 20 // 262149
y = (ms *   209716) >> 21 // 262149
y = (ms *   419431) >> 22 // 699059
y = (ms *   838861) >> 23 // 4194309
y = (ms *  1677722) >> 24 // 4194309
y = (ms *  3355444) >> 25 // 4194309
y = (ms *  6710887) >> 26 // 11184819
y = (ms * 13421773) >> 27 // 67108869

each line is a single, independent, calculation, and you'll see your first "error"/incorrect result at the value shown in the comment. you're generally better off taking the smallest shift for a given error value as this will minimise the extra bits needed to store the intermediate value in the calculation, e.g. (x * 13) >> 7 is "better" than (x * 52) >> 9 as it needs two less bits of overhead, while both start to give wrong answers above 68.

if you want to calculate more of these, the following (Python) code can be used:

def mul_from_shift(shift):
    mid = 2**shift + 5.
    return int(round(mid / 10.))

and I did the obvious thing for calculating when this approximation starts to go wrong with:

def first_err(mul, shift):
    i = 1
    while True:
        y = (i * mul) >> shift
        if y != i // 10:
            return i
        i += 1

(note that // is used for "integer" division, i.e. it truncates/rounds towards zero)

the reason for the "3/1" pattern in errors (i.e. 8 repeats 3 times followed by 9) seems to be due to the change in bases, i.e. log2(10) is ~3.32. if we plot the errors we get the following:

where the relative error is given by: mul_from_shift(shift) / (1<<shift) - 0.1

回复收藏 0 原文

沉睡月亮 2024-11-06 18:20:39

考虑到 Kuba Ober 的回应，还有另一个同样的回应。
它使用结果的迭代近似，但我不期望任何令人惊讶的性能。

假设我们必须找到 x，其中 x = v / 10。

我们将使用逆运算 v = x * 10，因为它具有一个很好的属性，即当 x = a + b 时，则 x * 10 = a * 10 + b * 10。

让我们使用 x 作为变量来保存迄今为止结果的最佳近似值。当搜索结束时，x将保存结果。我们将 x 的每一位 b 从最高有效位到最低有效位依次设置，最后比较 (x + b) * 10(x + b) * 10(x + b) * 10(x + b) * 10代码> 与 v。如果它小于或等于v，则在x中设置位b。为了测试下一位，我们只需将 b 向右移动一个位置（除以二）。

我们可以通过将 x * 10 和 b * 10 保存在其他变量中来避免乘以 10。

这产生了以下将 v 除以 10 的算法。

uin16_t x = 0, x10 = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
    uint16_t t = x10 + b10;
    if (t <= v) {
        x10 = t;
        x |= b;
    }
    b10 >>= 1;
    b >>= 1;
}
// x = v / 10

编辑：为了获得 Kuba Ober 的算法，它避免了变量 x10 的需要，我们可以从 v 和 v10 中减去 b10。在这种情况下，不再需要 x10。该算法变为

uin16_t x = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
    if (b10 <= v) {
        v -= b10;
        x |= b;
    }
    b10 >>= 1;
    b >>= 1;
}
// x = v / 10

可以展开循环，并且可以将 b 和 b10 的不同值预先计算为常量。

Considering Kuba Ober’s response, there is another one in the same vein.
It uses iterative approximation of the result, but I wouldn’t expect any surprising performances.

Let say we have to find x where x = v / 10.

We’ll use the inverse operation v = x * 10 because it has the nice property that when x = a + b, then x * 10 = a * 10 + b * 10.

Let use x as variable holding the best approximation of result so far. When the search ends, x Will hold the result. We’ll set each bit b of x from the most significant to the less significant, one by one, end compare (x + b) * 10 with v. If its smaller or equal to v, then the bit b is set in x. To test the next bit, we simply shift b one position to the right (divide by two).

We can avoid the multiplication by 10 by holding x * 10 and b * 10 in other variables.

This yields the following algorithm to divide v by 10.

uin16_t x = 0, x10 = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
    uint16_t t = x10 + b10;
    if (t <= v) {
        x10 = t;
        x |= b;
    }
    b10 >>= 1;
    b >>= 1;
}
// x = v / 10

Edit: to get the algorithm of Kuba Ober which avoids the need of variable x10 , we can subtract b10 from v and v10 instead. In this case x10 isn’t needed anymore. The algorithm becomes

uin16_t x = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
    if (b10 <= v) {
        v -= b10;
        x |= b;
    }
    b10 >>= 1;
    b >>= 1;
}
// x = v / 10

The loop may be unrolled and the different values of b and b10 may be precomputed as constants.

回复收藏 0 原文

复古式 2024-11-06 18:20:39

在一次只能移动一个位置的架构上，对 2 乘以 10 的递减幂进行一系列显式比较可能比黑客满意的解决方案效果更好。假设 16 位被除数：

uint16_t div10(uint16_t dividend) {
  uint16_t quotient = 0;
  #define div10_step(n) \
    do { if (dividend >= (n*10)) { quotient += n; dividend -= n*10; } } while (0)
  div10_step(0x1000);
  div10_step(0x0800);
  div10_step(0x0400);
  div10_step(0x0200);
  div10_step(0x0100);
  div10_step(0x0080);
  div10_step(0x0040);
  div10_step(0x0020);
  div10_step(0x0010);
  div10_step(0x0008);
  div10_step(0x0004);
  div10_step(0x0002);
  div10_step(0x0001);
  #undef div10_step
  if (dividend >= 5) ++quotient; // round the result (optional)
  return quotient;
}

On architectures that can only shift one place at a time, a series of explicit comparisons against decreasing powers of two multiplied by 10 might work better than the solution form hacker's delight. Assuming a 16 bit dividend:

uint16_t div10(uint16_t dividend) {
  uint16_t quotient = 0;
  #define div10_step(n) \
    do { if (dividend >= (n*10)) { quotient += n; dividend -= n*10; } } while (0)
  div10_step(0x1000);
  div10_step(0x0800);
  div10_step(0x0400);
  div10_step(0x0200);
  div10_step(0x0100);
  div10_step(0x0080);
  div10_step(0x0040);
  div10_step(0x0020);
  div10_step(0x0010);
  div10_step(0x0008);
  div10_step(0x0004);
  div10_step(0x0002);
  div10_step(0x0001);
  #undef div10_step
  if (dividend >= 5) ++quotient; // round the result (optional)
  return quotient;
}

回复收藏 0 原文

偷得浮生 2024-11-06 18:20:39

那么除法就是减法，所以是的。右移 1（除以 2）。现在从结果中减去 5，计算减法的次数，直到该值小于 5。结果就是您进行的减法次数。哦，划分可能会更快。

如果除法器中的逻辑尚未为您执行此操作，则先右移然后使用正常除法除以 5 的混合策略可能会提高性能。

回复收藏 0 原文

贱贱哒 2024-11-06 18:20:39

基于实时的答案，这里有一个基于Python的方法，支持无限精度：

def bit_div10( n ):
    bl = n.bit_length()
    
    q = (n >> 1) + (n >> 2)
    
    i = 2
    while 1<<i < bl:
        q += (q >> (1<<i))
        i += 1
    
    q = q >> 3
    r = n - (((q << 2) + q) << 1)
    return q + (r > 9)

输出：

>>> bit_div10( 1234567891000000 )
123456789100000
>>> bit_div10( 12345678901234567890123456789000000 )
1234567890123456789012345678900000
>>> bit_div10( 12345678901234567890123456789 )
1234567890123456789012345678

注意：这并不快，只是使用会破坏此处大多数答案（包括引用的答案）的值。

编辑：基于John Källén的答案的替代方法，它比以前的代码更小并且可能更快：

def div10( n ):
    l = 2 << int(log2( n.bit_length() ))
    return ( ((sum( 0x33 << (i<<3) for i in range(l >> 3) )>>1)^3) * n) >> l

输出与之前的代码。

为了稍微解释一下代码，它的工作原理是在 l 范围内复制 invDivisor 的模式（l = 8 << power基于除数位长度）。

基本上，如果 n 是 44 位，那么 l 就是 64 位，
其中 invDivisor 计算为 ( 0x3333333333333333 >> 1 ) ^ 3 ，结果为 0x199999999999999A。

它在任何位深度下都表现得非常好。

Based on realtime's answer, here's a python-based approach that supports infinite precision:

def bit_div10( n ):
    bl = n.bit_length()
    
    q = (n >> 1) + (n >> 2)
    
    i = 2
    while 1<<i < bl:
        q += (q >> (1<<i))
        i += 1
    
    q = q >> 3
    r = n - (((q << 2) + q) << 1)
    return q + (r > 9)

output:

>>> bit_div10( 1234567891000000 )
123456789100000
>>> bit_div10( 12345678901234567890123456789000000 )
1234567890123456789012345678900000
>>> bit_div10( 12345678901234567890123456789 )
1234567890123456789012345678

NOTE: This is not exactly fast, it just works with values that would break most answers here, including the referenced answer.

EDIT: Alternative approach based on John Källén's answer which is smaller and potentially faster than the previous code:

def div10( n ):
    l = 2 << int(log2( n.bit_length() ))
    return ( ((sum( 0x33 << (i<<3) for i in range(l >> 3) )>>1)^3) * n) >> l

The output is the same as the previous code.

To explain the code a bit, it works via replicating the pattern of invDivisor within the range of l (l = 8 << power based on divisor bit length).

Basically if n is 44 bits then l is 64 bits,
where invDivisor is calculated as ( 0x3333333333333333 >> 1 ) ^ 3 which results in 0x199999999999999A.

It works surprisingly well at any bit depth.

回复收藏 0 原文

追星践月 2024-11-06 18:20:39

对于无符号字节 - 小于 256 或 2⁸ 的数字：

uint8_t divideByTenUint8(uint8_t x) {
    unsigned t = 17 + (unsigned)x + ((unsigned)x << 4);
    return (uint8_t)((t + (t >> 1)) >> 8);
}

对于小于 1029 (2¹⁰+5) 的无符号数字（改编自 @AloisKraus 的答案）：

unsigned divideByTenLt1029(unsigned x) {
    uint_least32_t t = x; // in case of 16-bit ints
    uint_least32_t u = t + (t << 4);
    return (unsigned)((t + (u << 2) + (u << 3)) >> 11);
}

对于小于65536 (2¹⁶)：

uint16_t divideByTenUint16(uint16_t x) {
    uint_least32_t t = 257 + (uint_least32_t)x + ((uint_least32_t)x << 8);
    t += t << 4;
    return (uint16_t)((t + (t >> 1)) >> 16);
}

对于小于 4294967296 (2³²) 的数字：

uint32_t divideByTenUint32(uint32_t x) {
    uint_least64_t t = x;
    t += 3 + (t << 4) + (t << 8) + (t << 12);
    return (uint32_t)((t + (t << 1) + (t << 16) + (t << 17)) >> 33);
}

对于所有 64 位数字（需要 GCC 扩展）：

uint64_t divideByTenUint64(uint64_t x) {
    __uint128_t t = 1 + (__uint128_t)x;
    t+=t<<32, t+=t<<16, t+=t<<8, t+=t<<4;
    return (uint64_t)((t + (t >> 1)) >> 64);
}

For unsigned bytes—numbers less than 256 or 2⁸:

uint8_t divideByTenUint8(uint8_t x) {
    unsigned t = 17 + (unsigned)x + ((unsigned)x << 4);
    return (uint8_t)((t + (t >> 1)) >> 8);
}

For unsigned numbers less than 1029 (2¹⁰+5) (adapted from @AloisKraus's answer):

unsigned divideByTenLt1029(unsigned x) {
    uint_least32_t t = x; // in case of 16-bit ints
    uint_least32_t u = t + (t << 4);
    return (unsigned)((t + (u << 2) + (u << 3)) >> 11);
}

For numbers less than 65536 (2¹⁶):

uint16_t divideByTenUint16(uint16_t x) {
    uint_least32_t t = 257 + (uint_least32_t)x + ((uint_least32_t)x << 8);
    t += t << 4;
    return (uint16_t)((t + (t >> 1)) >> 16);
}

For numbers less than 4294967296 (2³²):

uint32_t divideByTenUint32(uint32_t x) {
    uint_least64_t t = x;
    t += 3 + (t << 4) + (t << 8) + (t << 12);
    return (uint32_t)((t + (t << 1) + (t << 16) + (t << 17)) >> 33);
}

For all 64-bit numbers (requires GCC extensions):

uint64_t divideByTenUint64(uint64_t x) {
    __uint128_t t = 1 + (__uint128_t)x;
    t+=t<<32, t+=t<<16, t+=t<<8, t+=t<<4;
    return (uint64_t)((t + (t >> 1)) >> 64);
}

回复收藏 0 原文

沦落红尘 2024-11-06 18:20:39

我在AVR汇编中设计了一种新方法，仅使用lsr/ror和sub/sbc。它除以 8，然后减去除以 64 和 128 的数，然后减去第 1,024 个和第 2,048 个，依此类推。工作非常可靠（包括精确舍入）且快速（1 MHz 时为 370 微秒）。
16位数字的源代码在这里：
http://www.avr-asm-tutorial.net/ avr_cn/beginner/DIV10/div10_16rd.asm
注释此源代码的页面在这里：
http://www.avr-asm-tutorial.net/ avr_cn/beginner/DIV10/DIV10.html
我希望它能有所帮助，尽管这个问题已经有十年了。
BRGS、GSC

回复收藏 0 原文

荭秂 2024-11-06 18:20:39

elemakil 的评论代码可以在这里找到： https://doc.lagout.org/security /Hackers%20Delight.pdf
第 233 页。“无符号除以 10 [和 11。]”

回复收藏 0 原文

使用位移位除以 10？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

这是一种快速近似，对于大输入来说并不精确

This is a fast approximation that's inexact for large inputs

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

使用位移位除以 10？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

这是一种快速近似，对于大输入来说并不精确

This is a fast approximation that's inexact for large inputs

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。