i=(i+1)&3 比 i=(i+1)%4 快吗

发布于 2024-12-19 18:59:36 字数 200 浏览 2 评论 0原文

我正在优化 C++ 代码。 在一个关键步骤中,我想实现以下函数y=f(x)

f(0)=1

f(1)=2

f(2)=3

f(3)=0

哪个更快?使用查找表或 i=(i+1)&3i=(i+1)%4 ?或者有更好的建议吗?

I am optimizing a c++ code.
at one critical step, I want to implement the following function y=f(x):

f(0)=1

f(1)=2

f(2)=3

f(3)=0

which one is faster ? using a lookup table or i=(i+1)&3 or i=(i+1)%4 ? or any better suggestion?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

只等公子 2024-12-26 18:59:36

几乎可以肯定,查找表将是最慢的。在很多情况下,编译器将为 (i+1)&3(i+1)%4 生成相同的程序集;然而,根据 i 的类型/符号,它们可能并不严格等效,并且编译器将无法进行优化。例如,对于我系统上的代码

int foo(int i)
{
    return (i+1)%4;
}

unsigned bar(unsigned i)
{
    return (i+1)%4;
}

gcc -O2 生成:

0000000000000000 <foo>:
   0:   8d 47 01                lea    0x1(%rdi),%eax
   3:   89 c2                   mov    %eax,%edx
   5:   c1 fa 1f                sar    $0x1f,%edx
   8:   c1 ea 1e                shr    $0x1e,%edx
   b:   01 d0                   add    %edx,%eax
   d:   83 e0 03                and    $0x3,%eax
  10:   29 d0                   sub    %edx,%eax
  12:   c3                      retq   

0000000000000020 <bar>:
  20:   8d 47 01                lea    0x1(%rdi),%eax
  23:   83 e0 03                and    $0x3,%eax
  26:   c3                      retq

因此,正如您所看到的,由于有关带符号模数结果的规则,(i+1)%4 生成首先有更多的代码。

最重要的是,如果 (i+1)&3 版本表达了您想要的内容,您可能最好使用它,因为编译器执行您不​​期望的操作的机会较小。

Almost certainly the lookup table is going to be slowest. In a lot of cases, the compiler will generate the same assembly for (i+1)&3 and (i+1)%4; however depending on the type/signedness of i, they may not be strictly equivalent and the compiler won't be able to make that optimization. For example for the code

int foo(int i)
{
    return (i+1)%4;
}

unsigned bar(unsigned i)
{
    return (i+1)%4;
}

on my system, gcc -O2 generates:

0000000000000000 <foo>:
   0:   8d 47 01                lea    0x1(%rdi),%eax
   3:   89 c2                   mov    %eax,%edx
   5:   c1 fa 1f                sar    $0x1f,%edx
   8:   c1 ea 1e                shr    $0x1e,%edx
   b:   01 d0                   add    %edx,%eax
   d:   83 e0 03                and    $0x3,%eax
  10:   29 d0                   sub    %edx,%eax
  12:   c3                      retq   

0000000000000020 <bar>:
  20:   8d 47 01                lea    0x1(%rdi),%eax
  23:   83 e0 03                and    $0x3,%eax
  26:   c3                      retq

so as you can see because of the rules about signed modulus results, (i+1)%4 generates a lot more code in the first place.

Bottom line, you're probably best off using the (i+1)&3 version if that expresses what you want, because there's less chance for the compiler to do something you don't expect.

茶色山野 2024-12-26 18:59:36

我不会讨论过早优化。但答案是它们的速度相同。

任何理智的编译器都会将它们编译成相同的东西。无论如何,除以 2 的幂的除法/模数将被优化为按位运算。

因此,请使用您发现的(或其他人会发现的)更具可读性的内容。

编辑:正如罗兰所指出的,它有时会根据符号的不同而表现不同:

无符号&:

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

无符号模数:

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

有符号&:

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

有符号模数:

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, -2147483645            ; 80000003H
jns SHORT $LN3@main
dec eax
or  eax, -4                 ; fffffffcH

I won't get into the discussion of premature optimization. But the answer is that they will be the same speed.

Any sane compiler will compile them to the same thing. Division/modulus by a power of two will be optimized to bitwise operations anyway.

So use whichever you find (or others will find) to be more readable.

EDIT : As Roland has pointed out, it does sometimes behave different depending on the signness:

Unsigned &:

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

Unsigned Modulus:

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

Signed &:

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

Signed Modulus:

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, -2147483645            ; 80000003H
jns SHORT $LN3@main
dec eax
or  eax, -4                 ; fffffffcH
烟沫凡尘 2024-12-26 18:59:36

很有可能,您不会发现任何差异:任何相当现代的编译器都知道将两者优化为相同的代码。

Good chances are, you wouldn't find any differences: any reasonably modern compiler knows to optimize both into the same code.

凌乱心跳 2024-12-26 18:59:36

您尝试过对其进行基准测试吗?作为一个临时猜测,我假设 &3 版本会更快,因为这是一个简单的加法和按位 AND 运算,这两者都应该是任何现代 ish 上的单周期运算中央处理器。

%4 可以采用几种不同的方式,具体取决于编译器的智能程度。它可以通过除法来完成,这比加法慢得多,或者它也可以转换为按位 and 运算,最终与 &3 一样快代码>版本。

Have you tried benchmarking it? As an offhand gues, I'll assume that the &3 version will be faster, as that's a simple addition and bitwise AND operation, both of which should be single-cycle operations on any modern-ish CPU.

The %4 could go a few different ways, depending on how smart the compiler is. it could be done via division, which is much slower than addition, or it could be translated into a bitwise and operation as well and end up being just as fast as the &3 version.

つ低調成傷 2024-12-26 18:59:36

与 Mystical 相同,但 C 和 ARM

int fun1 ( int i )
{
    return( (i+1)&3 );
}

int fun2 ( int i )
{
    return( (i+1)%4 );
}

unsigned int fun3 ( unsigned int i )
{
    return( (i+1)&3 );
}

unsigned int fun4 ( unsigned int i )
{
    return( (i+1)%4 );
}

创建:

00000000 <fun1>:
   0:   e2800001    add r0, r0, #1
   4:   e2000003    and r0, r0, #3
   8:   e12fff1e    bx  lr

0000000c <fun2>:
   c:   e2802001    add r2, r0, #1
  10:   e1a0cfc2    asr ip, r2, #31
  14:   e1a03f2c    lsr r3, ip, #30
  18:   e0821003    add r1, r2, r3
  1c:   e2010003    and r0, r1, #3
  20:   e0630000    rsb r0, r3, r0
  24:   e12fff1e    bx  lr

00000028 <fun3>:
  28:   e2800001    add r0, r0, #1
  2c:   e2000003    and r0, r0, #3
  30:   e12fff1e    bx  lr

00000034 <fun4>:
  34:   e2800001    add r0, r0, #1
  38:   e2000003    and r0, r0, #3
      3c:   e12fff1e    bx  lr

对于负数,掩码和模数不相等,仅对于正数/无符号数。对于这些情况,您的编译器应该知道 %4 与 &3 相同,并在 (&3) 上使用较便宜的(与上面的 gcc 一样)。下面是 clang/llc

fun3:                             
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr

fun4:
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr

same as Mystical but C and ARM

int fun1 ( int i )
{
    return( (i+1)&3 );
}

int fun2 ( int i )
{
    return( (i+1)%4 );
}

unsigned int fun3 ( unsigned int i )
{
    return( (i+1)&3 );
}

unsigned int fun4 ( unsigned int i )
{
    return( (i+1)%4 );
}

creates:

00000000 <fun1>:
   0:   e2800001    add r0, r0, #1
   4:   e2000003    and r0, r0, #3
   8:   e12fff1e    bx  lr

0000000c <fun2>:
   c:   e2802001    add r2, r0, #1
  10:   e1a0cfc2    asr ip, r2, #31
  14:   e1a03f2c    lsr r3, ip, #30
  18:   e0821003    add r1, r2, r3
  1c:   e2010003    and r0, r1, #3
  20:   e0630000    rsb r0, r3, r0
  24:   e12fff1e    bx  lr

00000028 <fun3>:
  28:   e2800001    add r0, r0, #1
  2c:   e2000003    and r0, r0, #3
  30:   e12fff1e    bx  lr

00000034 <fun4>:
  34:   e2800001    add r0, r0, #1
  38:   e2000003    and r0, r0, #3
      3c:   e12fff1e    bx  lr

For negative numbers the mask and the modulo are not equivalent, only for positive/unsigned numbers. For those cases your compiler should know that %4 is the same as &3 and use the less expensive on (&3) as gcc above. clang/llc below

fun3:                             
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr

fun4:
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr
素衣风尘叹 2024-12-26 18:59:36

当然&比 % 更快。以前的许多帖子都证明了这一点。另外,由于 i 是局部变量,因此您可以使用 ++i 而不是 i+1,因为大多数编译器都可以更好地实现它。 i+1 可能(不)被优化为 ++i。

更新:也许我不清楚,我的意思是,该函数应该只是“return((++i)&3);”

Ofcourse & is faster then %. Which is proven by many previous posts. Also as i is local variable, u can use ++i instead of i+1, as it is better implemented by most of the compilers. i+1 may(not) be optimized as ++i.

UPDATE: Perhaps i was not clear, i meant, the function should just "return((++i)&3);"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文