是“开关”吗?比“if”更快?

发布于 2024-11-25 22:30:37 字数 5361 浏览 3 评论 0原文

switch 语句实际上if 语句更快吗?

我在 Visual Studio 2010 的 x64 C++ 编译器上使用 /Ox 标志运行以下代码:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 29)
size_t counter = 0;

size_t testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        switch (counter % 4 + 1)
        {
            case 1: counter += 4; break;
            case 2: counter += 3; break;
            case 3: counter += 2; break;
            case 4: counter += 1; break;
        }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

size_t testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = counter % 4 + 1;
        if (c == 1) { counter += 4; }
        else if (c == 2) { counter += 3; }
        else if (c == 3) { counter += 2; }
        else if (c == 4) { counter += 1; }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    printf("Starting...\n");
    printf("Switch statement: %u ms\n", testSwitch());
    printf("If     statement: %u ms\n", testIf());
}

并得到以下结果:

Switch 语句:5261 毫秒
If 语句:5196 毫秒

根据我的了解,switch 语句显然使用跳转表来优化分支。

问题:

  1. 在 x86 或 x64 中,基本跳转表是什么样子?

  2. 此代码是否使用跳转表?

  3. 为什么这个例子中没有性能差异?是否存在显着性能差异的情况?


代码反汇编:

testIf:

13FE81B10 sub  rsp,48h 
13FE81B14 call qword ptr [__imp_clock (13FE81128h)] 
13FE81B1A mov  dword ptr [start],eax 
13FE81B1E mov  qword ptr [i],0 
13FE81B27 jmp  testIf+26h (13FE81B36h) 
13FE81B29 mov  rax,qword ptr [i] 
13FE81B2E inc  rax  
13FE81B31 mov  qword ptr [i],rax 
13FE81B36 cmp  qword ptr [i],20000000h 
13FE81B3F jae  testIf+0C3h (13FE81BD3h) 
13FE81B45 xor  edx,edx 
13FE81B47 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B4E mov  ecx,4 
13FE81B53 div  rax,rcx 
13FE81B56 mov  rax,rdx 
13FE81B59 inc  rax  
13FE81B5C mov  qword ptr [c],rax 
13FE81B61 cmp  qword ptr [c],1 
13FE81B67 jne  testIf+6Dh (13FE81B7Dh) 
13FE81B69 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B70 add  rax,4 
13FE81B74 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B7B jmp  testIf+0BEh (13FE81BCEh) 
13FE81B7D cmp  qword ptr [c],2 
13FE81B83 jne  testIf+89h (13FE81B99h) 
13FE81B85 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B8C add  rax,3 
13FE81B90 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B97 jmp  testIf+0BEh (13FE81BCEh) 
13FE81B99 cmp  qword ptr [c],3 
13FE81B9F jne  testIf+0A5h (13FE81BB5h) 
13FE81BA1 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BA8 add  rax,2 
13FE81BAC mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BB3 jmp  testIf+0BEh (13FE81BCEh) 
13FE81BB5 cmp  qword ptr [c],4 
13FE81BBB jne  testIf+0BEh (13FE81BCEh) 
13FE81BBD mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BC4 inc  rax  
13FE81BC7 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BCE jmp  testIf+19h (13FE81B29h) 
13FE81BD3 call qword ptr [__imp_clock (13FE81128h)] 
13FE81BD9 sub  eax,dword ptr [start] 
13FE81BDD imul eax,eax,3E8h 
13FE81BE3 cdq       
13FE81BE4 mov  ecx,3E8h 
13FE81BE9 idiv eax,ecx 
13FE81BEB cdqe      
13FE81BED add  rsp,48h 
13FE81BF1 ret       

testSwitch:

13FE81C00 sub  rsp,48h 
13FE81C04 call qword ptr [__imp_clock (13FE81128h)] 
13FE81C0A mov  dword ptr [start],eax 
13FE81C0E mov  qword ptr [i],0 
13FE81C17 jmp  testSwitch+26h (13FE81C26h) 
13FE81C19 mov  rax,qword ptr [i] 
13FE81C1E inc  rax  
13FE81C21 mov  qword ptr [i],rax 
13FE81C26 cmp  qword ptr [i],20000000h 
13FE81C2F jae  testSwitch+0C5h (13FE81CC5h) 
13FE81C35 xor  edx,edx 
13FE81C37 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C3E mov  ecx,4 
13FE81C43 div  rax,rcx 
13FE81C46 mov  rax,rdx 
13FE81C49 inc  rax  
13FE81C4C mov  qword ptr [rsp+30h],rax 
13FE81C51 cmp  qword ptr [rsp+30h],1 
13FE81C57 je   testSwitch+73h (13FE81C73h) 
13FE81C59 cmp  qword ptr [rsp+30h],2 
13FE81C5F je   testSwitch+87h (13FE81C87h) 
13FE81C61 cmp  qword ptr [rsp+30h],3 
13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
13FE81C69 cmp  qword ptr [rsp+30h],4 
13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
13FE81C71 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C73 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C7A add  rax,4 
13FE81C7E mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C85 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C87 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C8E add  rax,3 
13FE81C92 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C99 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C9B mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CA2 add  rax,2 
13FE81CA6 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CAD jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81CAF mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CB6 inc  rax  
13FE81CB9 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CC0 jmp  testSwitch+19h (13FE81C19h) 
13FE81CC5 call qword ptr [__imp_clock (13FE81128h)] 
13FE81CCB sub  eax,dword ptr [start] 
13FE81CCF imul eax,eax,3E8h 
13FE81CD5 cdq       
13FE81CD6 mov  ecx,3E8h 
13FE81CDB idiv eax,ecx 
13FE81CDD cdqe      
13FE81CDF add  rsp,48h 
13FE81CE3 ret       

更新:

有趣的结果此处。但不确定为什么一个更快,一个更慢。

Is a switch statement actually faster than an if statement?

I ran the code below on Visual Studio 2010's x64 C++ compiler with the /Ox flag:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 29)
size_t counter = 0;

size_t testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        switch (counter % 4 + 1)
        {
            case 1: counter += 4; break;
            case 2: counter += 3; break;
            case 3: counter += 2; break;
            case 4: counter += 1; break;
        }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

size_t testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = counter % 4 + 1;
        if (c == 1) { counter += 4; }
        else if (c == 2) { counter += 3; }
        else if (c == 3) { counter += 2; }
        else if (c == 4) { counter += 1; }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    printf("Starting...\n");
    printf("Switch statement: %u ms\n", testSwitch());
    printf("If     statement: %u ms\n", testIf());
}

and got these results:

Switch statement: 5261 ms
If statement: 5196 ms

From what I've learned, switch statements apparently use jump tables to optimize the branching.

Questions:

  1. What would a basic jump table look like, in x86 or x64?

  2. Is this code using a jump table?

  3. Why is there no performance difference in this example? Is there any situation in which there is a significant performance difference?


Disassembly of the code:

testIf:

13FE81B10 sub  rsp,48h 
13FE81B14 call qword ptr [__imp_clock (13FE81128h)] 
13FE81B1A mov  dword ptr [start],eax 
13FE81B1E mov  qword ptr [i],0 
13FE81B27 jmp  testIf+26h (13FE81B36h) 
13FE81B29 mov  rax,qword ptr [i] 
13FE81B2E inc  rax  
13FE81B31 mov  qword ptr [i],rax 
13FE81B36 cmp  qword ptr [i],20000000h 
13FE81B3F jae  testIf+0C3h (13FE81BD3h) 
13FE81B45 xor  edx,edx 
13FE81B47 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B4E mov  ecx,4 
13FE81B53 div  rax,rcx 
13FE81B56 mov  rax,rdx 
13FE81B59 inc  rax  
13FE81B5C mov  qword ptr [c],rax 
13FE81B61 cmp  qword ptr [c],1 
13FE81B67 jne  testIf+6Dh (13FE81B7Dh) 
13FE81B69 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B70 add  rax,4 
13FE81B74 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B7B jmp  testIf+0BEh (13FE81BCEh) 
13FE81B7D cmp  qword ptr [c],2 
13FE81B83 jne  testIf+89h (13FE81B99h) 
13FE81B85 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B8C add  rax,3 
13FE81B90 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B97 jmp  testIf+0BEh (13FE81BCEh) 
13FE81B99 cmp  qword ptr [c],3 
13FE81B9F jne  testIf+0A5h (13FE81BB5h) 
13FE81BA1 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BA8 add  rax,2 
13FE81BAC mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BB3 jmp  testIf+0BEh (13FE81BCEh) 
13FE81BB5 cmp  qword ptr [c],4 
13FE81BBB jne  testIf+0BEh (13FE81BCEh) 
13FE81BBD mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BC4 inc  rax  
13FE81BC7 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BCE jmp  testIf+19h (13FE81B29h) 
13FE81BD3 call qword ptr [__imp_clock (13FE81128h)] 
13FE81BD9 sub  eax,dword ptr [start] 
13FE81BDD imul eax,eax,3E8h 
13FE81BE3 cdq       
13FE81BE4 mov  ecx,3E8h 
13FE81BE9 idiv eax,ecx 
13FE81BEB cdqe      
13FE81BED add  rsp,48h 
13FE81BF1 ret       

testSwitch:

13FE81C00 sub  rsp,48h 
13FE81C04 call qword ptr [__imp_clock (13FE81128h)] 
13FE81C0A mov  dword ptr [start],eax 
13FE81C0E mov  qword ptr [i],0 
13FE81C17 jmp  testSwitch+26h (13FE81C26h) 
13FE81C19 mov  rax,qword ptr [i] 
13FE81C1E inc  rax  
13FE81C21 mov  qword ptr [i],rax 
13FE81C26 cmp  qword ptr [i],20000000h 
13FE81C2F jae  testSwitch+0C5h (13FE81CC5h) 
13FE81C35 xor  edx,edx 
13FE81C37 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C3E mov  ecx,4 
13FE81C43 div  rax,rcx 
13FE81C46 mov  rax,rdx 
13FE81C49 inc  rax  
13FE81C4C mov  qword ptr [rsp+30h],rax 
13FE81C51 cmp  qword ptr [rsp+30h],1 
13FE81C57 je   testSwitch+73h (13FE81C73h) 
13FE81C59 cmp  qword ptr [rsp+30h],2 
13FE81C5F je   testSwitch+87h (13FE81C87h) 
13FE81C61 cmp  qword ptr [rsp+30h],3 
13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
13FE81C69 cmp  qword ptr [rsp+30h],4 
13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
13FE81C71 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C73 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C7A add  rax,4 
13FE81C7E mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C85 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C87 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C8E add  rax,3 
13FE81C92 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C99 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C9B mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CA2 add  rax,2 
13FE81CA6 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CAD jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81CAF mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CB6 inc  rax  
13FE81CB9 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CC0 jmp  testSwitch+19h (13FE81C19h) 
13FE81CC5 call qword ptr [__imp_clock (13FE81128h)] 
13FE81CCB sub  eax,dword ptr [start] 
13FE81CCF imul eax,eax,3E8h 
13FE81CD5 cdq       
13FE81CD6 mov  ecx,3E8h 
13FE81CDB idiv eax,ecx 
13FE81CDD cdqe      
13FE81CDF add  rsp,48h 
13FE81CE3 ret       

Update:

Interesting results here. Not sure why one is faster and one is slower, though.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

还如梦归 2024-12-02 22:30:37

编译器可以对交换机进行多种优化。我不认为经常提到的“跳转表”是非常有用的,因为它仅在输入可以以某种方式限制时才有效。

“跳转表”的 C 伪代码类似于 this - 请注意,编译器在实践中需要在表周围插入某种形式的 if 测试,以确保表中的输入有效。另请注意,它仅适用于输入是一系列连续数字的特定情况。

如果开关中的分支数量非常大,编译器可以执行诸如对开关的值使用二分搜索之类的操作,这(在我看来)将是一种更有用的优化,因为它确实显着提高了某些功能的性能场景,与 switch 一样通用,并且不会导致生成的代码大小更大。但要看到这一点,您的测试代码将需要更多分支才能看到任何差异。

要回答您的具体问题:

  1. Clang 生成一个看起来像 这个

    test_switch(char): # @test_switch(char)
            movl %edi, %eax
            cmpl $19, %edi
            jbe.LBB0_1
            请求
    .LBB0_1:
            jmpq *.LJTI0_0(,%rax,8)
            jmp void call<0u>() # TAILCALL
            jmp void call<1u>() # TAILCALL
            jmp void call<2u>() # TAILCALL
            jmp void call<3u>() # TAILCALL
            jmp void call<4u>() # TAILCALL
            jmp void call<5u>() # TAILCALL
            jmp void call<6u>() # TAILCALL
            jmp void call<7u>() # TAILCALL
            jmp void call<8u>() # TAILCALL
            jmp void call<9u>() # TAILCALL
            jmp void call<10u>() # TAILCALL
            jmp void call<11u>() # TAILCALL
            jmp void call<12u>() # TAILCALL
            jmp void call<13u>() # TAILCALL
            jmp void call<14u>() # TAILCALL
            jmp void call<15u>() # TAILCALL
            jmp void call<16u>() # TAILCALL
            jmp void call<17u>() # TAILCALL
            jmp void call<18u>() # TAILCALL
            jmp void call<19u>() # TAILCALL
    .LJTI0_0:
            .quad .LBB0_2
            .quad .LBB0_3
            .quad .LBB0_4
            .quad .LBB0_5
            .quad .LBB0_6
            .quad .LBB0_7
            .quad .LBB0_8
            .quad .LBB0_9
            .quad .LBB0_10
            .quad .LBB0_11
            .quad .LBB0_12
            .quad .LBB0_13
            .quad .LBB0_14
            .quad .LBB0_15
            .quad .LBB0_16
            .quad .LBB0_17
            .quad .LBB0_18
            .quad .LBB0_19
            .quad .LBB0_20
            .quad .LBB0_21
    
  2. 我可以说它没有使用跳转表——4条比较指令清晰可见:

    <前><代码>13FE81C51 cmp qword ptr [rsp+30h],1
    13FE81C57 je 测试开关+73h (13FE81C73h)
    13FE81C59 cmp qword ptr [rsp+30h],2
    13FE81C5F je 测试开关+87h (13FE81C87h)
    13FE81C61 cmp qword ptr [rsp+30h],3
    13FE81C67 测试开关+9Bh (13FE81C9Bh)
    13FE81C69 cmp qword ptr [rsp+30h],4
    13FE81C6F je 测试开关+0AFh (13FE81CAFh)

    基于跳转表的解决方案根本不使用比较。

  3. 要么没有足够的分支导致编译器生成跳转表,要么您的编译器根本不生成它们。我不确定是哪一个。

编辑 2014:熟悉 LLVM 优化器的人在其他地方进行了一些讨论,表示跳转表优化在许多情况下都很重要;例如,在存在具有许多值的枚举并且许多情况与所述枚举中的值相反的情况下。也就是说,我坚持我在 2011 年所说的话——我经常看到人们在想“如果我做出改变,无论我有多少病例,时间都是一样的”——这是完全错误的。即使使用跳转表,您也会获得间接跳转成本,并为每种情况的表中的条目付费;内存带宽对于现代硬件来说是一个大问题。

编写代码以提高可读性。 <一href="http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22co mmentOnly%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22C4UwtgDgNghqA8BXAdgZwJYHNkgCYD4AoANwHt1cACAYxiigAoBKAbkJPKt FWAH1UA7umDUAFgzEwATjSaUA3oQCQg4WMoS5ipbVQhKABgBcynXSjwD%2BZmyVKARlJAwA1rd36AjCbu168T2tWU0dnN2UPSgAmHzN%2FKKDbBydXdxg9SgBmWL8LT MSQlPCzDIAWHPN4UoK7UNSI9P0AVgr%2FJprksLSMgDZWix6OuuLIgHZ%2B%2BFGhou79AA4JuemuhoyATgm15fqSr2NTXICrG0KV3cpPbwPKy%2B2RxouY6%2F9PBJP amdWvbOeLT3z3p0dpFPOVfgFqoDhrMLi1wZ52lDPudPH14YMkWcQeN4VNMcCHp4FvClvj7hlPBt4VsybYAL6EBkcCiUbh8dAAMwkomksgUyk5GmolAAvCLDFpwcdgko 6ZQQFAMoKJKLxZ5Jb4bjU5QqlRyharohq4hY3jKdYr9MrhWKssbDgDzfLLZRrYbSvbKpCnbqrfqVbamp62trnXqDbaesGBqHfa7%2FTbxaNo5NYy63ba5inST70wnDW sUzTc%2BGA2qDCnPNL6WG%2FRG1er%2BZqXoF3hbS4nHpWzTW4xm1ZlK47e3n6xCPU2TRC0x3DQjK4iS3Wyxco5PDqiZ8vO55k%2Bubnil%2FGx0TKzmR7PbZTK8X6Y zCEAAA%3D%22%2C%22compiler%22%3A%22clang371%22%2C%22options%22%3A%22%20-std%3Dc%2B%2B14%20-O3%20-Wall%20-Wextra% 22%7D%5D%7D">任意值得称道的编译器会看到一个 if / else if 梯形图,并将其转换为等效的开关,反之亦然,如果这样做会更快的话。

There are several optimizations a compiler can make on a switch. I don't think the oft-mentioned "jump-table" is a very useful one though, as it only works when the input can be bounded some way.

C Pseudocode for a "jump table" would be something like this -- note that the compiler in practice would need to insert some form of if test around the table to ensure that the input was valid in the table. Note also that it only works in the specific case that the input is a run of consecutive numbers.

If the number of branches in a switch is extremely large, a compiler can do things like using binary search on the values of the switch, which (in my mind) would be a much more useful optimization, as it does significantly increase performance in some scenarios, is as general as a switch is, and does not result in greater generated code size. But to see that, your test code would need a LOT more branches to see any difference.

To answer your specific questions:

  1. Clang generates one that looks like this:

    test_switch(char):                       # @test_switch(char)
            movl    %edi, %eax
            cmpl    $19, %edi
            jbe     .LBB0_1
            retq
    .LBB0_1:
            jmpq    *.LJTI0_0(,%rax,8)
            jmp     void call<0u>()         # TAILCALL
            jmp     void call<1u>()         # TAILCALL
            jmp     void call<2u>()         # TAILCALL
            jmp     void call<3u>()         # TAILCALL
            jmp     void call<4u>()         # TAILCALL
            jmp     void call<5u>()         # TAILCALL
            jmp     void call<6u>()         # TAILCALL
            jmp     void call<7u>()         # TAILCALL
            jmp     void call<8u>()         # TAILCALL
            jmp     void call<9u>()         # TAILCALL
            jmp     void call<10u>()        # TAILCALL
            jmp     void call<11u>()        # TAILCALL
            jmp     void call<12u>()        # TAILCALL
            jmp     void call<13u>()        # TAILCALL
            jmp     void call<14u>()        # TAILCALL
            jmp     void call<15u>()        # TAILCALL
            jmp     void call<16u>()        # TAILCALL
            jmp     void call<17u>()        # TAILCALL
            jmp     void call<18u>()        # TAILCALL
            jmp     void call<19u>()        # TAILCALL
    .LJTI0_0:
            .quad   .LBB0_2
            .quad   .LBB0_3
            .quad   .LBB0_4
            .quad   .LBB0_5
            .quad   .LBB0_6
            .quad   .LBB0_7
            .quad   .LBB0_8
            .quad   .LBB0_9
            .quad   .LBB0_10
            .quad   .LBB0_11
            .quad   .LBB0_12
            .quad   .LBB0_13
            .quad   .LBB0_14
            .quad   .LBB0_15
            .quad   .LBB0_16
            .quad   .LBB0_17
            .quad   .LBB0_18
            .quad   .LBB0_19
            .quad   .LBB0_20
            .quad   .LBB0_21
    
  2. I can say that it is not using a jump table -- 4 comparison instructions are clearly visible:

    13FE81C51 cmp  qword ptr [rsp+30h],1 
    13FE81C57 je   testSwitch+73h (13FE81C73h) 
    13FE81C59 cmp  qword ptr [rsp+30h],2 
    13FE81C5F je   testSwitch+87h (13FE81C87h) 
    13FE81C61 cmp  qword ptr [rsp+30h],3 
    13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
    13FE81C69 cmp  qword ptr [rsp+30h],4 
    13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
    

    A jump table based solution does not use comparison at all.

  3. Either not enough branches to cause the compiler to generate a jump table, or your compiler simply doesn't generate them. I'm not sure which.

EDIT 2014: There has been some discussion elsewhere from people familiar with the LLVM optimizer saying that the jump table optimization can be important in many scenarios; e.g. in cases where there is an enumeration with many values and many cases against values in said enumeration. That said, I stand by what I said above in 2011 -- too often I see people thinking "if I make it a switch, it'll be the same time no matter how many cases I have" -- and that's completely false. Even with a jump table you get the indirect jump cost and you pay for entries in the table for each case; and memory bandwidth is a Big Deal on modern hardware.

Write code for readability. Any compiler worth its salt is going to see an if / else if ladder and transform it into equivalent switch or vice versa if it would be faster to do so.

So要识趣 2024-12-02 22:30:37

对于您的问题:

1.在 x86 或 x64 中,基本跳转表是什么样的?

跳转表是内存地址,它保存指向数组结构之类的标签的指针。以下示例将帮助您了解跳转表的布局方式

00B14538  D8 09 AB 00 D8 09 AB 00 D8 09 AB 00 D8 09 AB 00  Ø.«.Ø.«.Ø.«.Ø.«.
00B14548  D8 09 AB 00 D8 09 AB 00 D8 09 AB 00 00 00 00 00  Ø.«.Ø.«.Ø.«.....
00B14558  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00B14568  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

在此处输入图像描述

其中 00B14538 是指向跳转表的指针,值如 D8 09 AB 00代表标签指针。

2.这段代码是否使用了跳转表?
在这种情况下不行。

3.为什么此示例中没有性能差异?

没有性能差异,因为两种情况的指令看起来相同,没有跳转表。

4.是否存在显着性能差异

的情况?如果您有很长的if检查序列,在这种情况下使用跳转表可以提高性能(分支) /jmp 指令昂贵(如果它们不能近乎完美地预测的话),但会带来内存成本。

所有比较指令的代码也有一定的大小,因此特别是对于 32 位指针或偏移量,单个跳转表查找可能不会在可执行文件中花费更多的大小。

结论:编译器足够聪明,可以处理这种情况并生成适当的指令:)

To your question:

1.What would a basic jump table look like, in x86 or x64?

Jump table is memory address that holds pointer to the labels in something like array structure. following example will help you understand how jump tables are laid out

00B14538  D8 09 AB 00 D8 09 AB 00 D8 09 AB 00 D8 09 AB 00  Ø.«.Ø.«.Ø.«.Ø.«.
00B14548  D8 09 AB 00 D8 09 AB 00 D8 09 AB 00 00 00 00 00  Ø.«.Ø.«.Ø.«.....
00B14558  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00B14568  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

enter image description here

Where 00B14538 is the pointer to the Jump table , and value like D8 09 AB 00 represents label pointer.

2.Is this code using a jump table?
No in this case.

3.Why is there no performance difference in this example?

There is no performance difference because instruction for both case looks same, no jump table.

4.Is there any situation in which there is a significant performance difference?

If you have very long sequence of if check, in that case using a jump table improves performance (branching/jmp instructions are expensive if they don't predict near-perfectly) but comes with the cost of memory.

The code for all the compare instructions has some size, too, so especially with 32-bit pointers or offsets, a single jump table lookup might not cost a lot more size in an executable.

Conclusion: Compiler is smart enough handle such case and generate appropriate instructions :)

紫竹語嫣☆ 2024-12-02 22:30:37

编译器可以自由地将 switch 语句编译为相当于 if 语句的代码,或者创建跳转表。它可能会根据执行速度最快或生成最小的代码来选择其中之一,这在某种程度上取决于您在编译器选项中指定的内容 - 所以最坏的情况它将与

我信任编译器的 if 语句的速度相同做出最佳选择并专注于使代码最具可读性的内容。

如果情况数量变得非常大,跳转表将比一系列 if 快得多。然而,如果值之间的步长非常大,则跳转表可能会变得很大,并且编译器可能会选择不生成跳转表。

The compiler is free to compile the switch statement as a code which is equivalent to if-statement, or to create a jump table. It will likely chose one on the other based on what will execute fastest or generate the smallest code somewhat depending on what you have specified in you compiler options -- so worst case it will be the same speed as if-statements

I would trust the compiler to do the best choice and focus on what makes the code most readable.

If the number of cases becomes very large a jump table will be much faster than a series of if. However if the steps between the values is very large, then the jump table can become large, and the compiler may choose not to generate one.

久夏青 2024-12-02 22:30:37

您如何知道您的计算机在 switch 测试循环期间没有执行一些与测试无关的任务,并且在 if 测试循环期间执行较少的任务?您的测试结果没有显示任何内容,因为:

  1. 差异非常小,
  2. 只有一个结果,而不是一系列结果,
  3. 情况太少

我的结果:

添加了:

printf("counter: %u\n", counter);

我在最后 这样它就不会优化循环作为计数器在您的示例中从未使用过,那么编译器为什么要执行循环?即使使用这样的微基准,交换机也总是会立即获胜。

代码的另一个问题是:

switch (counter % 4 + 1)

在 switch 循环中,而不是

const size_t c = counter % 4 + 1; 

在 if 循环中。如果你解决这个问题,差别很大。我相信将语句放入 switch 语句中会促使编译器将值直接发送到 CPU 寄存器中,而不是先将其放入堆栈中。因此,这有利于 switch 语句,而不是平衡测试。

哦,我认为你还应该在测试之间重置计数器。事实上,您可能应该使用某种随机数而不是 +1、+2、+3 等,因为它可能会优化那里的某些内容。例如,随机数是指基于当前时间的数字。否则,编译器可能会将您的两个函数都转换为一个很长的数学运算,甚至不会打扰任何循环。

我已经修改了 Ryan 的代码,足以确保编译器在代码运行之前无法弄清楚事情:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 26)
size_t counter = 0;

long long testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = rand() % 20 + 1;

        switch (c)
        {
                case 1: counter += 20; break;
                case 2: counter += 33; break;
                case 3: counter += 62; break;
                case 4: counter += 15; break;
                case 5: counter += 416; break;
                case 6: counter += 3545; break;
                case 7: counter += 23; break;
                case 8: counter += 81; break;
                case 9: counter += 256; break;
                case 10: counter += 15865; break;
                case 11: counter += 3234; break;
                case 12: counter += 22345; break;
                case 13: counter += 1242; break;
                case 14: counter += 12341; break;
                case 15: counter += 41; break;
                case 16: counter += 34321; break;
                case 17: counter += 232; break;
                case 18: counter += 144231; break;
                case 19: counter += 32; break;
                case 20: counter += 1231; break;
        }
    }
    return 1000 * (long long)(clock() - start) / CLOCKS_PER_SEC;
}

long long testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = rand() % 20 + 1;
        if (c == 1) { counter += 20; }
        else if (c == 2) { counter += 33; }
        else if (c == 3) { counter += 62; }
        else if (c == 4) { counter += 15; }
        else if (c == 5) { counter += 416; }
        else if (c == 6) { counter += 3545; }
        else if (c == 7) { counter += 23; }
        else if (c == 8) { counter += 81; }
        else if (c == 9) { counter += 256; }
        else if (c == 10) { counter += 15865; }
        else if (c == 11) { counter += 3234; }
        else if (c == 12) { counter += 22345; }
        else if (c == 13) { counter += 1242; }
        else if (c == 14) { counter += 12341; }
        else if (c == 15) { counter += 41; }
        else if (c == 16) { counter += 34321; }
        else if (c == 17) { counter += 232; }
        else if (c == 18) { counter += 144231; }
        else if (c == 19) { counter += 32; }
        else if (c == 20) { counter += 1231; }
    }
    return 1000 * (long long)(clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    srand(time(NULL));
    printf("Starting...\n");
    printf("Switch statement: %lld ms\n", testSwitch()); fflush(stdout);
    printf("counter: %d\n", counter);
    counter = 0;
    srand(time(NULL));
    printf("If     statement: %lld ms\n", testIf()); fflush(stdout);
    printf("counter: %d\n", counter);
} 

switch: 3740
if: 3980

(多次尝试的类似结果)

我还将 case/if 的数量减少到 5,并且 switch 函数仍然获胜。

How do you know your computer was not performing some task unrelated to the test during the switch test loop and performing fewer tasks during the if test loop? Your test results do not show anything as:

  1. the difference is very small
  2. there is only one result, not a series of results
  3. there are too few cases

My results:

I addded:

printf("counter: %u\n", counter);

to the end so that it would not optimise away the loop as counter was never used in your example so why would the compiler perform the loop? Immediately, the switch was always winning even with such a micro-benchmark.

The other problem with your code is:

switch (counter % 4 + 1)

in your switch loop, versus

const size_t c = counter % 4 + 1; 

in your if loop. Very big difference if you fix that. I believe that putting the statement inside the switch statement provokes the compiler into sending the value directly into the CPU registers rather than putting it on the stack first. This is therefore in favour of the switch statement and not a balanced test.

Oh and I think you should also reset counter between tests. In fact, you probably should be using some kind of random number instead of +1, +2, +3 etc, as it will probably optimise something there. By random number, I mean a number based on the current time, for example. Otherwise, the compiler could turn both of your functions into one long math operation and not even bother with any loops.

I have modified Ryan's code just enough to make sure the compiler couldn't figure things out before the code had run:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 26)
size_t counter = 0;

long long testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = rand() % 20 + 1;

        switch (c)
        {
                case 1: counter += 20; break;
                case 2: counter += 33; break;
                case 3: counter += 62; break;
                case 4: counter += 15; break;
                case 5: counter += 416; break;
                case 6: counter += 3545; break;
                case 7: counter += 23; break;
                case 8: counter += 81; break;
                case 9: counter += 256; break;
                case 10: counter += 15865; break;
                case 11: counter += 3234; break;
                case 12: counter += 22345; break;
                case 13: counter += 1242; break;
                case 14: counter += 12341; break;
                case 15: counter += 41; break;
                case 16: counter += 34321; break;
                case 17: counter += 232; break;
                case 18: counter += 144231; break;
                case 19: counter += 32; break;
                case 20: counter += 1231; break;
        }
    }
    return 1000 * (long long)(clock() - start) / CLOCKS_PER_SEC;
}

long long testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = rand() % 20 + 1;
        if (c == 1) { counter += 20; }
        else if (c == 2) { counter += 33; }
        else if (c == 3) { counter += 62; }
        else if (c == 4) { counter += 15; }
        else if (c == 5) { counter += 416; }
        else if (c == 6) { counter += 3545; }
        else if (c == 7) { counter += 23; }
        else if (c == 8) { counter += 81; }
        else if (c == 9) { counter += 256; }
        else if (c == 10) { counter += 15865; }
        else if (c == 11) { counter += 3234; }
        else if (c == 12) { counter += 22345; }
        else if (c == 13) { counter += 1242; }
        else if (c == 14) { counter += 12341; }
        else if (c == 15) { counter += 41; }
        else if (c == 16) { counter += 34321; }
        else if (c == 17) { counter += 232; }
        else if (c == 18) { counter += 144231; }
        else if (c == 19) { counter += 32; }
        else if (c == 20) { counter += 1231; }
    }
    return 1000 * (long long)(clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    srand(time(NULL));
    printf("Starting...\n");
    printf("Switch statement: %lld ms\n", testSwitch()); fflush(stdout);
    printf("counter: %d\n", counter);
    counter = 0;
    srand(time(NULL));
    printf("If     statement: %lld ms\n", testIf()); fflush(stdout);
    printf("counter: %d\n", counter);
} 

switch: 3740
if: 3980

(similar results over multiple attempts)

I also reduced the number of cases/ifs to 5 and the switch function still won.

鹿港巷口少年归 2024-12-02 22:30:37

以下是旧的(现在很难找到)bench++ 基准测试的一些结果:

Test Name:   F000003                         Class Name:  Style
CPU Time:       0.781  nanoseconds           plus or minus     0.0715
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way if/else if statement
 compare this test with F000004

Test Name:   F000004                         Class Name:  Style
CPU Time:        1.53  nanoseconds           plus or minus     0.0767
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way switch statement
 compare this test with F000003

Test Name:   F000005                         Class Name:  Style
CPU Time:        7.70  nanoseconds           plus or minus      0.385
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way if/else if statement
 compare this test with F000006

Test Name:   F000006                         Class Name:  Style
CPU Time:        2.00  nanoseconds           plus or minus     0.0999
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way switch statement
 compare this test with F000005

Test Name:   F000007                         Class Name:  Style
CPU Time:        3.41  nanoseconds           plus or minus      0.171
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way sparse switch statement
 compare this test with F000005 and F000006

我们可以从中看到(在这台机器上,使用这个编译器 - VC++ 9.0 x64),每个 if 测试需要大约0.7纳秒。随着测试数量的增加,时间几乎完全呈线性增长。

使用 switch 语句,只要值密集,2 路测试和 10 路测试之间的速度几乎没有差异。使用稀疏值的 10 路测试花费的时间大约是使用密集值的 10 路测试的 1.6 倍 - 但即使使用稀疏值,仍然比 10 路 if/否则如果

底线:仅使用 4 路测试并不能真正向您展示太多关于 switchif/else 的性能。如果您查看此代码中的数字,很容易插入这样一个事实:对于 4 路测试,我们预计两者会产生相当相似的结果(约 2.8 纳秒) code>if/elseswitch 的 ~2.0)。

Here are some results from the old (now hard to find) bench++ benchmark:

Test Name:   F000003                         Class Name:  Style
CPU Time:       0.781  nanoseconds           plus or minus     0.0715
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way if/else if statement
 compare this test with F000004

Test Name:   F000004                         Class Name:  Style
CPU Time:        1.53  nanoseconds           plus or minus     0.0767
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 2-way switch statement
 compare this test with F000003

Test Name:   F000005                         Class Name:  Style
CPU Time:        7.70  nanoseconds           plus or minus      0.385
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way if/else if statement
 compare this test with F000006

Test Name:   F000006                         Class Name:  Style
CPU Time:        2.00  nanoseconds           plus or minus     0.0999
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way switch statement
 compare this test with F000005

Test Name:   F000007                         Class Name:  Style
CPU Time:        3.41  nanoseconds           plus or minus      0.171
Wall/CPU:        1.00  ratio.                Iteration Count:  1677721600
Test Description:
 Time to test a global using a 10-way sparse switch statement
 compare this test with F000005 and F000006

What we can see from this is that (on this machine, with this compiler -- VC++ 9.0 x64), each if test takes about 0.7 nanoseconds. As the number of tests goes up, the time scales almost perfectly linearly.

With the switch statement, there's almost no difference in speed between a 2-way and a 10-way test, as long as the values are dense. The 10-way test with sparse values takes about 1.6x as much time as the 10-way test with dense values -- but even with sparse values, still better than twice the speed of a 10-way if/else if.

Bottom line: using only a 4-way test won't really show you much about the performance of switch vs if/else. If you look at the numbers from this code, it's pretty easy to interpolate the fact that for a 4-way test, we'd expect the two to produce pretty similar results (~2.8 nanoseconds for an if/else, ~2.0 for switch).

孤凫 2024-12-02 22:30:37

一个好的优化编译器(例如 MSVC)可以生成:

  1. 如果案例排列在一个很好的长范围内,则生成一个简单的跳转表;
  2. 如果有很多间隙,则生成一个稀疏(两级)跳转表;
  3. 如果案例数量较少,则生成一系列 if 语句。值为
    不会紧密结合在一起
  4. 如果案例代表几组,则上述组合
    紧密间隔的范围。

简而言之,如果 switch 看起来比一系列 if 慢,编译器可能会将其转换为一个。它可能不仅仅是每种情况的比较序列,而是二叉搜索树。有关示例,请参阅此处

A good optimizing compiler such as MSVC can generate:

  1. a simple jump table if the cases are arranged in a nice long range
  2. a sparse (two-level) jump table if there are many gaps
  3. a series of ifs if the number of cases is small or the values are
    not close together
  4. a combination of above if the cases represent several groups of
    closely-spaced ranges.

In short, if the switch looks to be slower than a series of ifs, the compiler might just convert it to one. And it's likely to be not just a sequence of comparisons for each case, but a binary search tree. See here for an example.

傲性难收 2024-12-02 22:30:37

我将回答 2) 并发表一些一般性评论。 2)不,您发布的汇编代码中没有跳转表。跳转表是跳转目的地的表,以及从表中直接跳转到索引位置的一两个指令。当有许多可能的切换目标时,跳转表会更有意义。也许优化器知道简单的 if else 逻辑会更快,除非目的地数量大于某个阈值。再次尝试您的示例,假设有 20 种可能性而不是 4 种。

I'll answer 2) and make some general comments. 2) No, there is no jump table in the assembly code you've posted. A jump table is a table of jump destinations, and one or two instructions to jump directly to an indexed location from the table. A jump table would make more sense when there are many possible switch destinations. Maybe the optimiser knows that simple if else logic is faster unless the number of destinations is greater than some threshold. Try your example again with say 20 possibilities instead of 4.

成熟稳重的好男人 2024-12-02 22:30:37

我很感兴趣,并查看了我可以对您的示例进行哪些更改,以使其更快地运行 switch 语句。

如果达到 40 个 if 语句,并添加 0 个 case,则 if 块的运行速度将比等效的 switch 语句慢。我这里有结果:https://www.ideone.com/KZeCz

删除 0 案例的效果可以在这里看到:https://www.ideone.com/LFnrX

I was intrigued, and took a look at what I could change about your example to get it to run the switch statement faster.

If you get to 40 if statements, and add a 0 case, then the if block will run slower than the equivalent switch statement. I have the results here: https://www.ideone.com/KZeCz.

The effect of removing the 0 case can be seen here: https://www.ideone.com/LFnrX.

吻风 2024-12-02 22:30:37

请注意,当开关未编译为跳转表时,您通常可以编写 if 比开关更有效...

(1) 如果情况有顺序,而不是对所有 N 进行最坏情况测试,您可以编写你的 if 是测试是否在上半部分或下半部分,然后在每一半部分,二分搜索风格...

如果某些情况/组比其他情况更频繁,则最坏的情况是 logN 而不是 N (2) ,然后设计你的 if 来首先隔离这些情况可以加快速度平均通过时间

Note that when a switch is NOT compiled to a jump table, you can very often write if's more efficiently than the switch...

(1) if the cases have an ordering, rather than the worst case testing for all N, you can write your if's to test if in the upper or lower half, then in each half of that, binary search style... resulting in the worst case being logN rather than N

(2) if certain cases/groups are far more frequent than other cases, then designing your if's to isolate those cases first can speed up the average time through

执笔绘流年 2024-12-02 22:30:37

没有任何。在大多数特定情况下,当您进入汇编器并对性能进行实际测量时,您的问题根本就是错误的。对于给定的示例,您的想法绝对太短了,因为

counter += (4 - counter % 4);

在我看来,您应该使用正确的增量表达式。

None. In most particular cases where you go into the assembler and do real measurements of performance your question is simply the wrong one. For the given example your thinking goes definitively too short since

counter += (4 - counter % 4);

looks to me to be the correct increment expression that you should be using.

微凉徒眸意 2024-12-02 22:30:37

不,这些是如果然后跳转其他如果然后跳转其他...跳转表将有一个地址表或使用哈希或类似的东西。

更快或更慢是主观的。例如,您可以将情况 1 作为最后一件事,而不是首先进行,如果您的测试程序或现实世界程序大部分时间都使用情况 1,则此实现的代码会变慢。因此,只需根据实施情况重新排列案例列表即可产生很大的不同。

如果您使用案例 0-3 而不是 1-4,编译器可能使用了跳转表,编译器应该已经想出删除您的 +1 的方法。也许是物品数量太少的缘故。例如,如果您将其设置为 0 - 15 或 0 - 31,它可能会使用表格或使用其他快捷方式来实现它。编译器可以自由选择如何实现事物,只要它满足源代码的功能即可。这涉及到编译器差异、版本差异和优化差异。如果你想要一个跳转表,就创建一个跳转表,如果你想要一个 if-then-else 树,就创建一个 if-then-else 树。如果您希望编译器做出决定,请使用 switch/case 语句。

No these are if then jump else if then jump else...A jump table would have a table of addresses or use a hash or something like that.

Faster or slower is subjective. You could for example have case 1 be the last thing instead of first and if your test program or real world program used case 1 most of the time the code would be slower with this implementation. So just re-arranging the case list, depending on the implementation, can make a big difference.

If you had used cases 0-3 instead of 1-4 the compiler might have used a jump table, the compiler should have figured out removing your +1 anyway. Perhaps it was the small number of items. Had you made it 0 - 15 or 0 - 31 for example it may have implemented it with a table or used some other shortcut. The compiler is free to choose how it implements things so long as it meets the functionality of the source code. And this gets into compiler differences and version differences and optimization differences. If you want a jump table, make a jump table, if you want an if-then-else tree make an if-then-else tree. If you want the compiler to decide, use a switch/case statement.

灯下孤影 2024-12-02 22:30:37

但不确定为什么一个更快,一个更慢。

这实际上并不太难解释......如果您还记得错误预测的分支比正确预测的分支贵数十到数百倍。

% 20 版本中,第一个 case/if 始终是命中的那个。现代CPU“学习”通常采用哪些分支,不采用哪些分支,因此它们可以轻松预测该分支在循环的几乎每次迭代中的行为方式。这就解释了为什么“如果”版本会成功;它永远不必执行第一个测试之后的任何内容,并且它(正确)预测大多数迭代的测试结果。显然,“开关”的实现方式略有不同——甚至可能是一个跳转表,由于计算分支,它可能会很慢。

%21版本中,分支本质上是随机的。因此,它们中的许多不仅每次迭代都会执行,CPU 也无法猜测它们会走哪条路。在这种情况下,跳转表(或其他“开关”优化)可能会有所帮助。

预测一段代码在现代编译器和 CPU 下的执行情况非常困难,而且每一代都变得更加困难。最好的建议是“根本不用费心去尝试;总是进行侧写”。这些建议每年都会变得更好,而能够成功忽略它的人也越来越少。

所有这些都表明我上面的解释很大程度上是一种猜测。 :-)

Not sure why one is faster and one is slower, though.

That is actually not too hard to explain... If you remember that mispredicted branches are tens to hundreds of times more expensive than correctly predicted branches.

In the % 20 version, the first case/if is always the one that hits. Modern CPUs "learn" which branches are usually taken and which are not, so they can easily predict how this branch will behave on almost every iteration of the loop. That explains why the "if" version flies; it never has to execute anything past the first test, and it (correctly) predicts the result of that test for most of the iterations. Obviously the "switch" is implemented slightly differently -- perhaps even a jump table, which can be slow thanks to the computed branch.

In the % 21 version, the branches are essentially random. So not only do many of them execute every iteration, the CPU cannot guess which way they will go. This is the case where a jump table (or other "switch" optimization) is likely to help.

It is very hard to predict how a piece of code is going to perform with a modern compiler and CPU, and it gets harder with every generation. The best advice is "don't even bother trying; always profile". That advice gets better -- and the set of people who can ignore it successfully gets smaller -- every year.

All of which is to say that my explanation above is largely a guess. :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文