有效地实现地板/欧几里得整数除法

发布于 2024-09-30 16:26:22 字数 178 浏览 3 评论 0原文

向下除法是指结果始终向下取整(朝向 −∞),而不是朝向 0:

division types

是否可以在 C/C++ 中有效实现地板或欧几里得整数除法?

(显而易见的解决方案是检查股息的符号)

Floored division is when the result is always floored down (towards −∞), not towards 0:

division types

Is it possible to efficiently implement floored or euclidean integer division in C/C++?

(the obvious solution is to check the dividend's sign)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

燃情 2024-10-07 16:26:22

我编写了一个测试程序来对此处提出的想法进行基准测试:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <windows.h>

#define N 10000000
#define M 100

int dividends[N], divisors[N], results[N];

__forceinline int floordiv_signcheck(int a, int b)
{
    return (a<0 ? a-(b-1) : a) / b;
}

__forceinline int floordiv_signcheck2(int a, int b)
{
    return (a - (a<0 ? b-1 : 0)) / b;
}

__forceinline int floordiv_signmultiply(int a, int b)
{
    return (a + (a>>(sizeof(a)*8-1))*(b-1)) / b;
}

__forceinline int floordiv_floatingpoint(int a, int b)
{
    // I imagine that the call to floor can be replaced to a cast
    // if you can get FPU rounding control to work (I couldn't).
    return floor((double)a / b);
}

void main()
{
    for (int i=0; i<N; i++)
    {
        dividends[i] = rand();
        do
            divisors[i] = rand();
        while (divisors[i]==0);
    }

    LARGE_INTEGER t0, t1;

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signcheck(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signcheck    : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signcheck2(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signcheck2   : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signmultiply(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signmultiply : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_floatingpoint(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("floatingpoint: %9llu\n", t1.QuadPart-t0.QuadPart);
}

结果:

signcheck    :  61458768
signcheck2   :  61284370
signmultiply :  61625076
floatingpoint: 287315364

因此,根据我的结果,检查符号是最快的:

(a - (a<0 ? b-1 : 0)) / b

I've written a test program to benchmark the ideas presented here:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <windows.h>

#define N 10000000
#define M 100

int dividends[N], divisors[N], results[N];

__forceinline int floordiv_signcheck(int a, int b)
{
    return (a<0 ? a-(b-1) : a) / b;
}

__forceinline int floordiv_signcheck2(int a, int b)
{
    return (a - (a<0 ? b-1 : 0)) / b;
}

__forceinline int floordiv_signmultiply(int a, int b)
{
    return (a + (a>>(sizeof(a)*8-1))*(b-1)) / b;
}

__forceinline int floordiv_floatingpoint(int a, int b)
{
    // I imagine that the call to floor can be replaced to a cast
    // if you can get FPU rounding control to work (I couldn't).
    return floor((double)a / b);
}

void main()
{
    for (int i=0; i<N; i++)
    {
        dividends[i] = rand();
        do
            divisors[i] = rand();
        while (divisors[i]==0);
    }

    LARGE_INTEGER t0, t1;

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signcheck(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signcheck    : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signcheck2(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signcheck2   : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signmultiply(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signmultiply : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_floatingpoint(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("floatingpoint: %9llu\n", t1.QuadPart-t0.QuadPart);
}

Results:

signcheck    :  61458768
signcheck2   :  61284370
signmultiply :  61625076
floatingpoint: 287315364

So, according to my results, checking the sign is the fastest:

(a - (a<0 ? b-1 : 0)) / b
计㈡愣 2024-10-07 16:26:22

五年后我重新审视这个问题,因为这也与我相关。我对 x86-64 的两个纯 C 版本和两个内联汇编版本进行了一些性能测量,结果可能很有趣。

地板划分的测试变体是:

  1. 我已经使用了一段时间的实现;
  2. 与上面介绍的略有不同,仅使用一个分区;
  3. 前一个,但是在内联汇编中手动实现;以及
  4. 在汇编中实现的 CMOV 版本。

以下是我的基准测试程序:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#ifndef VARIANT
#define VARIANT 3
#endif

#if VARIANT == 0
#define floordiv(a, b) (((a) < 0)?((((a) + 1) / (b)) - 1):((a) / (b)))
#elif VARIANT == 1
#define floordiv(a, b) ((((a) < 0)?((a) - ((b) - 1)):(a)) / (b))
#elif VARIANT == 2
#define floordiv(a, b) ({                                   \
    int result;                                             \
    asm("test %%eax, %%eax; jns 1f; sub %1, %%eax;"         \
        "add $1, %%eax; 1: cltd; idivl %1;"                 \
        : "=a" (result)                                     \
        : "r" (b),                                          \
          "0" (a)                                           \
        : "rdx");                                           \
    result;})
#elif VARIANT == 3
#define floordiv(a, b) ({                                           \
    int result;                                                     \
    asm("mov %%eax, %%edx; sub %1, %%edx; add $1, %%edx;"           \
        "test %%eax, %%eax; cmovs %%edx, %%eax; cltd;"              \
        "idivl %1;"                                                 \
        : "=a" (result)                                             \
        : "r" (b),                                                  \
          "0" (a)                                                   \
        : "rdx");                                                   \
    result;})
#endif

double ntime(void)
{
    struct timeval tv;

    gettimeofday(&tv, NULL);
    return(tv.tv_sec + (((double)tv.tv_usec) / 1000000.0));
}

void timediv(int n, int *p, int *q, int *r)
{
    int i;

    for(i = 0; i < n; i++)
        r[i] = floordiv(p[i], q[i]);
}

int main(int argc, char **argv)
{
    int n, i, *q, *p, *r;
    double st;

    n = 10000000;
    p = malloc(sizeof(*p) * n);
    q = malloc(sizeof(*q) * n);
    r = malloc(sizeof(*r) * n);
    for(i = 0; i < n; i++) {
        p[i] = (rand() % 1000000) - 500000;
        q[i] = (rand() % 1000000) + 1;
    }

    st = ntime();
    for(i = 0; i < 100; i++)
        timediv(n, p, q, r);
    printf("%g\n", ntime() - st);
    return(0);
}

我使用 GCC 4.9.2 使用 gcc -march=native -Ofast 编译了该程序,在我的 Core i5-2400 上的结果如下。每次运行的结果都相当可重复——至少它们总是以相同的顺序着陆。

  • 变体 0:7.21 秒
  • 变体 1:7.26 秒
  • 变体 2:6.73 秒
  • 变体 3:4.32 秒

因此,CMOV 实现至少胜过其他实现。让我惊讶的是,变体 2 远远超过了它的纯 C 版本(变体 1)。我认为编译器应该能够发出至少与我的代码一样高效的代码。

以下是一些其他平台,用于比较:

AMD Athlon 64 X2 4200+、GCC 4.7.2:

  • 变体 0:26.33 秒
  • 变体 1:25.38 秒
  • 变体 2:25.19 秒
  • 变体 3:22.39 秒

Xeon E3-1271 v3、GCC 4.9。 2:

  • 变体 0:5.95 秒
  • 变体 1:5.62 秒
  • 变体 2:5.40 秒
  • 变体 3:3.44 秒

最后一点,我也许应该警告不要过于认真地对待 CMOV 版本的明显性能优势,因为在现实世界中,其他版本中的分支可能不会像此基准测试中那样完全随机,并且如果分支预测器可以完成合理的工作,则分支版本可能会更好。然而,现实情况在很大程度上取决于实践中使用的数据,因此尝试进行任何通用基准测试可能毫无意义。

I'm revisiting this question five years later, as this is relevant for me too. I did some performance measurements on two pure-C versions and two inline-assembly versions for x86-64, and the results may be interesting.

The tested variants of floored division are:

  1. The implementation I've been using for some time now;
  2. The slight variant on that presented above which only uses one division;
  3. The previous one, but hand-implemented in inline-assembly; and
  4. A CMOV version implemented in assembly.

The following is my benchmark program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#ifndef VARIANT
#define VARIANT 3
#endif

#if VARIANT == 0
#define floordiv(a, b) (((a) < 0)?((((a) + 1) / (b)) - 1):((a) / (b)))
#elif VARIANT == 1
#define floordiv(a, b) ((((a) < 0)?((a) - ((b) - 1)):(a)) / (b))
#elif VARIANT == 2
#define floordiv(a, b) ({                                   \
    int result;                                             \
    asm("test %%eax, %%eax; jns 1f; sub %1, %%eax;"         \
        "add $1, %%eax; 1: cltd; idivl %1;"                 \
        : "=a" (result)                                     \
        : "r" (b),                                          \
          "0" (a)                                           \
        : "rdx");                                           \
    result;})
#elif VARIANT == 3
#define floordiv(a, b) ({                                           \
    int result;                                                     \
    asm("mov %%eax, %%edx; sub %1, %%edx; add $1, %%edx;"           \
        "test %%eax, %%eax; cmovs %%edx, %%eax; cltd;"              \
        "idivl %1;"                                                 \
        : "=a" (result)                                             \
        : "r" (b),                                                  \
          "0" (a)                                                   \
        : "rdx");                                                   \
    result;})
#endif

double ntime(void)
{
    struct timeval tv;

    gettimeofday(&tv, NULL);
    return(tv.tv_sec + (((double)tv.tv_usec) / 1000000.0));
}

void timediv(int n, int *p, int *q, int *r)
{
    int i;

    for(i = 0; i < n; i++)
        r[i] = floordiv(p[i], q[i]);
}

int main(int argc, char **argv)
{
    int n, i, *q, *p, *r;
    double st;

    n = 10000000;
    p = malloc(sizeof(*p) * n);
    q = malloc(sizeof(*q) * n);
    r = malloc(sizeof(*r) * n);
    for(i = 0; i < n; i++) {
        p[i] = (rand() % 1000000) - 500000;
        q[i] = (rand() % 1000000) + 1;
    }

    st = ntime();
    for(i = 0; i < 100; i++)
        timediv(n, p, q, r);
    printf("%g\n", ntime() - st);
    return(0);
}

I compiled this with gcc -march=native -Ofast using GCC 4.9.2, and the results, on my Core i5-2400, were as follows. The results are fairly reproducible from run to run -- they always land in the same order, at least.

  • Variant 0: 7.21 seconds
  • Variant 1: 7.26 seconds
  • Variant 2: 6.73 seconds
  • Variant 3: 4.32 seconds

So the CMOV implementation blows the others out of the water, at least. What surprises me is that variant 2 out-does its pure-C version (variant 1) by a fairly wide margin. I'd have thought the compiler should be able to emit code at least as efficient as mine.

Here are some other platforms, for comparison:

AMD Athlon 64 X2 4200+, GCC 4.7.2:

  • Variant 0: 26.33 seconds
  • Variant 1: 25.38 seconds
  • Variant 2: 25.19 seconds
  • Variant 3: 22.39 seconds

Xeon E3-1271 v3, GCC 4.9.2:

  • Variant 0: 5.95 seconds
  • Variant 1: 5.62 seconds
  • Variant 2: 5.40 seconds
  • Variant 3: 3.44 seconds

As a final note, I should perhaps warn against taking the apparent performance advantage of the CMOV version too seriously, because in the real world, the branch in the other versions will probably not be as completely random as in this benchmark, and if the branch predictor can do a reasonable job, the branching versions may turn out to be better. However, the realities of that will depend quite a bit on the data that are being used in practice, and so is probably pointless to try and do any generic benchmark of.

睫毛上残留的泪 2024-10-07 16:26:22

提出一些无分支的东西来根据符号纠正结果可能会更有效,因为分支的成本很高。

请参阅 第 2 章 的第 20 页。 /" rel="nofollow">黑客之乐 了解如何访问该标志。

It could be more efficient to come up with something branch free to correct the result based on the sign, as branches are expensive.

See page 20ff of Chapter 2 in Hacker's Delight on how to access the sign.

慈悲佛祖 2024-10-07 16:26:22

请注意:x86 sar 指令在涉及 2 的幂时执行向下除法。

Just a note: the x86 sar instruction performs floored division when it comes to powers of two.

海拔太高太耀眼 2024-10-07 16:26:22

由于 IEEE-754 指定向 -inf 舍入作为所需舍入模式之一,我想您问题的答案是肯定的。但也许您可以解释一下您是否想知道如果编写编译器,将如何实现该过程,或者想知道如何使用特定编译器来执行该操作?

Since IEEE-754 specifies round towards -inf as one of the required rounding modes I imagine that the answer to your question is yes. But perhaps you can explain whether you want to know how one would implement the procedure if one were writing the compiler, or to know how to use a particular compiler to perform the operation ?

洛阳烟雨空心柳 2024-10-07 16:26:22

是否可以在 C/C++ 中有效地实现取整除法或欧几里得整数除法?

是的。

(显而易见的解决方案是检查被除数的符号)

我完全同意,并且很难相信存在更快的替代方案。

Is it possible to efficiently implement floored or euclidian integer division in C/C++?

Yes.

(the obvious solution is to check the dividend's sign)

I agree completely, and would find it hard to believe there exists an alternative that is significantly faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文