i=(i+1)&3 比 i=(i+1)%4 快吗

发布于 2024-12-19 18:59:36 字数 200 浏览 2 评论 0原文

我正在优化 C++ 代码。在一个关键步骤中，我想实现以下函数y=f(x)：

f(0)=1

f(1)=2

f(2)=3

f(3)=0

哪个更快？使用查找表或 i=(i+1)&3 或 i=(i+1)%4 ？或者有更好的建议吗？

原文

I am optimizing a c++ code.
at one critical step, I want to implement the following function y=f(x):

f(0)=1

f(1)=2

f(2)=3

f(3)=0

which one is faster ? using a lookup table or i=(i+1)&3 or i=(i+1)%4 ? or any better suggestion?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只等公子 2024-12-26 18:59:36

几乎可以肯定，查找表将是最慢的。在很多情况下，编译器将为 (i+1)&3 和 (i+1)%4 生成相同的程序集；然而，根据 i 的类型/符号，它们可能并不严格等效，并且编译器将无法进行优化。例如，对于我系统上的代码

int foo(int i)
{
    return (i+1)%4;
}

unsigned bar(unsigned i)
{
    return (i+1)%4;
}

，gcc -O2 生成：

0000000000000000 <foo>:
   0:   8d 47 01                lea    0x1(%rdi),%eax
   3:   89 c2                   mov    %eax,%edx
   5:   c1 fa 1f                sar    $0x1f,%edx
   8:   c1 ea 1e                shr    $0x1e,%edx
   b:   01 d0                   add    %edx,%eax
   d:   83 e0 03                and    $0x3,%eax
  10:   29 d0                   sub    %edx,%eax
  12:   c3                      retq   

0000000000000020 <bar>:
  20:   8d 47 01                lea    0x1(%rdi),%eax
  23:   83 e0 03                and    $0x3,%eax
  26:   c3                      retq

因此，正如您所看到的，由于有关带符号模数结果的规则，(i+1)%4 生成首先有更多的代码。

最重要的是，如果 (i+1)&3 版本表达了您想要的内容，您可能最好使用它，因为编译器执行您不期望的操作的机会较小。

Almost certainly the lookup table is going to be slowest. In a lot of cases, the compiler will generate the same assembly for (i+1)&3 and (i+1)%4; however depending on the type/signedness of i, they may not be strictly equivalent and the compiler won't be able to make that optimization. For example for the code

int foo(int i)
{
    return (i+1)%4;
}

unsigned bar(unsigned i)
{
    return (i+1)%4;
}

on my system, gcc -O2 generates:

0000000000000000 <foo>:
   0:   8d 47 01                lea    0x1(%rdi),%eax
   3:   89 c2                   mov    %eax,%edx
   5:   c1 fa 1f                sar    $0x1f,%edx
   8:   c1 ea 1e                shr    $0x1e,%edx
   b:   01 d0                   add    %edx,%eax
   d:   83 e0 03                and    $0x3,%eax
  10:   29 d0                   sub    %edx,%eax
  12:   c3                      retq   

0000000000000020 <bar>:
  20:   8d 47 01                lea    0x1(%rdi),%eax
  23:   83 e0 03                and    $0x3,%eax
  26:   c3                      retq

so as you can see because of the rules about signed modulus results, (i+1)%4 generates a lot more code in the first place.

Bottom line, you're probably best off using the (i+1)&3 version if that expresses what you want, because there's less chance for the compiler to do something you don't expect.

回复收藏 0 原文

茶色山野 2024-12-26 18:59:36

我不会讨论过早优化。但答案是它们的速度相同。

任何理智的编译器都会将它们编译成相同的东西。无论如何，除以 2 的幂的除法/模数将被优化为按位运算。

因此，请使用您发现的（或其他人会发现的）更具可读性的内容。

编辑：正如罗兰所指出的，它有时会根据符号的不同而表现不同：

无符号&：

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

无符号模数：

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

有符号&：

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

有符号模数：

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, -2147483645            ; 80000003H
jns SHORT $LN3@main
dec eax
or  eax, -4                 ; fffffffcH

I won't get into the discussion of premature optimization. But the answer is that they will be the same speed.

Any sane compiler will compile them to the same thing. Division/modulus by a power of two will be optimized to bitwise operations anyway.

So use whichever you find (or others will find) to be more readable.

EDIT : As Roland has pointed out, it does sometimes behave different depending on the signness:

Unsigned &:

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

Unsigned Modulus:

int main(void)
{
    unsigned x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

Signed &:

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) & 3;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push    eax

Signed Modulus:

int main(void)
{
    int x;
    cin >> x;
    x = (x + 1) % 4;
    cout << x;

    return 0;
}

mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, -2147483645            ; 80000003H
jns SHORT $LN3@main
dec eax
or  eax, -4                 ; fffffffcH

回复收藏 0 原文

烟沫凡尘 2024-12-26 18:59:36

很有可能，您不会发现任何差异：任何相当现代的编译器都知道将两者优化为相同的代码。

回复收藏 0 原文

凌乱心跳 2024-12-26 18:59:36

您尝试过对其进行基准测试吗？作为一个临时猜测，我假设 &3 版本会更快，因为这是一个简单的加法和按位 AND 运算，这两者都应该是任何现代 ish 上的单周期运算中央处理器。

%4 可以采用几种不同的方式，具体取决于编译器的智能程度。它可以通过除法来完成，这比加法慢得多，或者它也可以转换为按位 and 运算，最终与 &3 一样快代码>版本。

回复收藏 0 原文

つ低調成傷 2024-12-26 18:59:36

与 Mystical 相同，但 C 和 ARM

int fun1 ( int i )
{
    return( (i+1)&3 );
}

int fun2 ( int i )
{
    return( (i+1)%4 );
}

unsigned int fun3 ( unsigned int i )
{
    return( (i+1)&3 );
}

unsigned int fun4 ( unsigned int i )
{
    return( (i+1)%4 );
}

创建：

00000000 <fun1>:
   0:   e2800001    add r0, r0, #1
   4:   e2000003    and r0, r0, #3
   8:   e12fff1e    bx  lr

0000000c <fun2>:
   c:   e2802001    add r2, r0, #1
  10:   e1a0cfc2    asr ip, r2, #31
  14:   e1a03f2c    lsr r3, ip, #30
  18:   e0821003    add r1, r2, r3
  1c:   e2010003    and r0, r1, #3
  20:   e0630000    rsb r0, r3, r0
  24:   e12fff1e    bx  lr

00000028 <fun3>:
  28:   e2800001    add r0, r0, #1
  2c:   e2000003    and r0, r0, #3
  30:   e12fff1e    bx  lr

00000034 <fun4>:
  34:   e2800001    add r0, r0, #1
  38:   e2000003    and r0, r0, #3
      3c:   e12fff1e    bx  lr

对于负数，掩码和模数不相等，仅对于正数/无符号数。对于这些情况，您的编译器应该知道 %4 与 &3 相同，并在 (&3) 上使用较便宜的（与上面的 gcc 一样）。下面是 clang/llc

fun3:                             
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr

fun4:
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr

same as Mystical but C and ARM

int fun1 ( int i )
{
    return( (i+1)&3 );
}

int fun2 ( int i )
{
    return( (i+1)%4 );
}

unsigned int fun3 ( unsigned int i )
{
    return( (i+1)&3 );
}

unsigned int fun4 ( unsigned int i )
{
    return( (i+1)%4 );
}

creates:

00000000 <fun1>:
   0:   e2800001    add r0, r0, #1
   4:   e2000003    and r0, r0, #3
   8:   e12fff1e    bx  lr

0000000c <fun2>:
   c:   e2802001    add r2, r0, #1
  10:   e1a0cfc2    asr ip, r2, #31
  14:   e1a03f2c    lsr r3, ip, #30
  18:   e0821003    add r1, r2, r3
  1c:   e2010003    and r0, r1, #3
  20:   e0630000    rsb r0, r3, r0
  24:   e12fff1e    bx  lr

00000028 <fun3>:
  28:   e2800001    add r0, r0, #1
  2c:   e2000003    and r0, r0, #3
  30:   e12fff1e    bx  lr

00000034 <fun4>:
  34:   e2800001    add r0, r0, #1
  38:   e2000003    and r0, r0, #3
      3c:   e12fff1e    bx  lr

For negative numbers the mask and the modulo are not equivalent, only for positive/unsigned numbers. For those cases your compiler should know that %4 is the same as &3 and use the less expensive on (&3) as gcc above. clang/llc below

fun3:                             
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr

fun4:
    add r0, r0, #1
    and r0, r0, #3
    mov pc, lr

回复收藏 0 原文

素衣风尘叹 2024-12-26 18:59:36

当然&比 % 更快。以前的许多帖子都证明了这一点。另外，由于 i 是局部变量，因此您可以使用 ++i 而不是 i+1，因为大多数编译器都可以更好地实现它。 i+1 可能（不）被优化为 ++i。

更新：也许我不清楚，我的意思是，该函数应该只是“return((++i)&3);”

回复收藏 0 原文

~没有更多了~

关于作者

软糖

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

i=(i+1)&3 比 i=(i+1)%4 快吗

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

i=(i+1)&3 比 i=(i+1)%4 快吗

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。