当前位置：文江博客话题详情

基准测试错误，组装令人费解

发布于 2025-01-17 07:08:06 字数 19139 浏览 2 评论 0 原文

这里是组装新手。我编写了一个基准测试来测量机器在计算转置矩阵张量乘积时的浮点性能。

鉴于我的机器具有 32GiB RAM（带宽 ~37GiB/s）和 Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) 处理器，我估计最大性能（使用流水线和寄存器中的数据）为为 6 核 x 4.0GHz = 24GFLOP/s。然而，当我运行基准测试时，我测量到的速度为 127GFLOP/s，这显然是一个错误的测量结果。

注意：为了测量 FP 性能，我测量操作数：n*n*n*n*6（n^3 用于矩阵-矩阵乘法，在n 复数数据点切片上执行，即假设 1 次复数-复数乘法需要 6 次 FLOP），并将其除以每次运行所需的平均时间。

主函数中的代码片段：

// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
    #pragma noinline
    do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);

代码片段：do_timed_run：

void do_timed_run(const std::size_t& n, double& avg_dur)
{
    // create the data and lay first touch
    auto operand0 = matrix<double>(n, n);
    auto operand1 = tensor<double>(n, n, n);
    auto result = tensor<double>(n, n, n);
    
    // first touch
    #pragma omp parallel
    {
        set_first_touch(operand1);
        set_first_touch(result);
    }
    
    // do the experiment
    const auto dur1 = omp_get_wtime() * 1E+6;
    #pragma omp parallel firstprivate(operand0)
    {
        #pragma noinline
        transp_matrix_tensor_mult(operand0, operand1, result);
    }
    const auto dur2 = omp_get_wtime() * 1E+6;
    avg_dur += dur2 - dur1;
}

注释：

此时，我不提供函数的代码transp_matrix_tensor_mult 因为我认为它不相关。
#pragma noinline 是一个调试装置，我用它来更好地理解反汇编器的输出。

现在对函数 do_timed_run 进行反汇编：

0000000000403a20 <_Z12do_timed_runRKmRd>:
  403a20:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403a27:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403a2e:   00 
  403a2f:   48 89 fd                mov    %rdi,%rbp
  403a32:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403a39:   00 
  403a3a:   48 89 f3                mov    %rsi,%rbx
  403a3d:   48 89 ee                mov    %rbp,%rsi
  403a40:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403a45:   48 89 ea                mov    %rbp,%rdx
  403a48:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403a4f:   00 
  403a50:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403a57:   00 
  403a58:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403a5f:   00 
  403a60:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403a67:   00 
  403a68:   e8 03 f8 ff ff          callq  403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
  403a6d:   48 89 ee                mov    %rbp,%rsi
  403a70:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403a75:   48 89 ea                mov    %rbp,%rdx
  403a78:   48 89 e9                mov    %rbp,%rcx
  403a7b:   e8 80 f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a80:   48 89 ee                mov    %rbp,%rsi
  403a83:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403a88:   48 89 ea                mov    %rbp,%rdx
  403a8b:   48 89 e9                mov    %rbp,%rcx
  403a8e:   e8 6d f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a93:   bf 88 f3 44 00          mov    $0x44f388,%edi
  403a98:   e8 53 f7 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403a9d:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403aa4:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403aa9:   33 c0                   xor    %eax,%eax
  403aab:   e8 20 f6 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403ab0:   85 c0                   test   %eax,%eax
  403ab2:   74 21                   je     403ad5 <_Z12do_timed_runRKmRd+0xb5>
  403ab4:   ba a5 3c 40 00          mov    $0x403ca5,%edx
  403ab9:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403abe:   be 02 00 00 00          mov    $0x2,%esi
  403ac3:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403ac8:   33 c0                   xor    %eax,%eax
  403aca:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403ace:   e8 cd f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403ad3:   eb 41                   jmp    403b16 <_Z12do_timed_runRKmRd+0xf6>
  403ad5:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403ada:   33 c0                   xor    %eax,%eax
  403adc:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403ae3:   e8 58 f7 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403ae8:   be 9c 13 47 00          mov    $0x47139c,%esi
  403aed:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403af4:   00 
  403af5:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403afa:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403afe:   e8 a2 01 00 00          callq  403ca5 <_Z12do_timed_runRKmRd+0x285>
  403b03:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403b08:   33 c0                   xor    %eax,%eax
  403b0a:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b11:   e8 aa f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b16:   e8 85 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b1b:   c5 fb 11 04 24          vmovsd %xmm0,(%rsp)
  403b20:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b25:   33 c0                   xor    %eax,%eax
  403b27:   e8 a4 f5 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403b2c:   85 c0                   test   %eax,%eax
  403b2e:   74 25                   je     403b55 <_Z12do_timed_runRKmRd+0x135>
  403b30:   ba 0b 3c 40 00          mov    $0x403c0b,%edx
  403b35:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b3a:   be 03 00 00 00          mov    $0x3,%esi
  403b3f:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403b44:   33 c0                   xor    %eax,%eax
  403b46:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403b4a:   4c 8d 49 70             lea    0x70(%rcx),%r9
  403b4e:   e8 4d f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403b53:   eb 45                   jmp    403b9a <_Z12do_timed_runRKmRd+0x17a>
  403b55:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b5a:   33 c0                   xor    %eax,%eax
  403b5c:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b63:   e8 d8 f6 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403b68:   be a0 13 47 00          mov    $0x4713a0,%esi
  403b6d:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403b74:   00 
  403b75:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403b7a:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403b7e:   4c 8d 42 70             lea    0x70(%rdx),%r8
  403b82:   e8 84 00 00 00          callq  403c0b <_Z12do_timed_runRKmRd+0x1eb>
  403b87:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b8c:   33 c0                   xor    %eax,%eax
  403b8e:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b95:   e8 26 f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b9a:   e8 01 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b9f:   c5 fb 5c 0c 24          vsubsd (%rsp),%xmm0,%xmm1
  403ba4:   c5 fb 10 05 cc c4 01    vmovsd 0x1c4cc(%rip),%xmm0        # 420078 <alpha_beta.61562.0.0.28+0x28>
  403bab:   00 
  403bac:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403bb1:   c4 e2 f9 a9 0b          vfmadd213sd (%rbx),%xmm0,%xmm1
  403bb6:   c5 fb 11 0b             vmovsd %xmm1,(%rbx)
  403bba:   e8 71 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bbf:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403bc4:   e8 67 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bc9:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403bce:   e8 5d f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bd3:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403bda:   00 
  403bdb:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403be2:   00 
  403be3:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403bea:   00 
  403beb:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403bf2:   00 
  403bf3:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403bfa:   00 
  403bfb:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c02:   00 
  403c03:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403c0a:   c3                      retq   
  403c0b:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403c12:   4c 89 c6                mov    %r8,%rsi
  403c15:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403c1c:   00 
  403c1d:   4c 8d 24 24             lea    (%rsp),%r12
  403c21:   4c 89 e7                mov    %r12,%rdi
  403c24:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403c2b:   00 
  403c2c:   48 89 cd                mov    %rcx,%rbp
  403c2f:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403c36:   00 
  403c37:   48 89 d3                mov    %rdx,%rbx
  403c3a:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403c41:   00 
  403c42:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403c49:   00 
  403c4a:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403c51:   00 
  403c52:   e8 49 03 00 00          callq  403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
  403c57:   4c 89 e7                mov    %r12,%rdi
  403c5a:   48 89 de                mov    %rbx,%rsi
  403c5d:   48 89 ea                mov    %rbp,%rdx
  403c60:   e8 8b 01 00 00          callq  403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
  403c65:   4c 89 e7                mov    %r12,%rdi
  403c68:   e8 63 01 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>     # <--- ...and here it ends
  403c6d:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403c74:   00 
  403c75:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403c7c:   00 
  403c7d:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403c84:   00 
  403c85:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403c8c:   00 
  403c8d:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403c94:   00 
  403c95:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c9c:   00 
  403c9d:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403ca4:   c3                      retq   
  403ca5:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403cac:   48 89 d7                mov    %rdx,%rdi
  403caf:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403cb6:   00 
  403cb7:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403cbe:   00 
  403cbf:   48 89 cb                mov    %rcx,%rbx
  403cc2:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403cc9:   00 
  403cca:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403cd1:   00 
  403cd2:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403cd9:   00 
  403cda:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403ce1:   00 
  403ce2:   e8 99 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
  403ce7:   48 89 df                mov    %rbx,%rdi
  403cea:   e8 91 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
  403cef:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403cf6:   00 
  403cf7:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403cfe:   00 
  403cff:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403d06:   00 
  403d07:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403d0e:   00 
  403d0f:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403d16:   00 
  403d17:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403d1e:   00 
  403d1f:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403d26:   c3                      retq   
  403d27:   48 89 04 24             mov    %rax,(%rsp)
  403d2b:   bf 30 f4 44 00          mov    $0x44f430,%edi
  403d30:   e8 bb f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d35:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d3c:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403d41:   e8 9a 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d46:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403d4b:   e8 90 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d50:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403d55:   e8 76 00 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>
  403d5a:   48 8b 3c 24             mov    (%rsp),%rdi
  403d5e:   e8 5d f3 ff ff          callq  4030c0 <_Unwind_Resume@plt>
  403d63:   48 89 04 24             mov    %rax,(%rsp)
  403d67:   bf 68 f4 44 00          mov    $0x44f468,%edi
  403d6c:   e8 7f f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d71:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d78:   eb cc                   jmp    403d46 <_Z12do_timed_runRKmRd+0x326>
  403d7a:   48 89 04 24             mov    %rax,(%rsp)
  403d7e:   bf a0 f4 44 00          mov    $0x44f4a0,%edi
  403d83:   e8 68 f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d88:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d8f:   eb bf                   jmp    403d50 <_Z12do_timed_runRKmRd+0x330>
  403d91:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  403d98:   00 
  403d99:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

主要问题：

我假设该函数是在定时区域之外调用的，这样正确吗？
如果上述情况属实，为什么会发生这种情况？
如果上述情况不正确，我如何找出我的基准测试错误的原因？

第二个问题：

为什么代码中存在非条件跳转（在403ad3、403b53、403d78和403d8f处）？
为什么同一函数中有 3 个 retq 实例，只有一个返回路径（位于 403c0a、403ca4 和 403d26）？

请注意，我仅提供了我认为相关的信息。我们很乐意根据要求提供更多信息。预先感谢您抽出时间。

编辑：

@PeterCordes 我确实在启用调试符号的情况下进行构建。上面发布的程序集是使用 objdump 获得的，但不知何故没有检索到所需的符号。下面是使用 icpc 获得的程序集（片段）：

#       omp_get_wtime()
        call      omp_get_wtime                                 #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
                                # LOE rbx xmm0
..B4.12:                        # Preds ..B4.11
                                # Execution count [1.00e+00]
..LN420:
        vmovsd    %xmm0, (%rsp)                                 #122.23[spill]
..LN421:
                                # LOE rbx
..B4.13:                        # Preds ..B4.12
                                # Execution count [1.00e+00]
..LN422:
    .loc    1  123  is_stmt 1
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN423:
        xorl      %eax, %eax                                    #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
        call      __kmpc_ok_to_fork                             #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
                                # LOE rbx eax
..B4.14:                        # Preds ..B4.13
                                # Execution count [1.00e+00]
..LN426:
        testl     %eax, %eax                                    #123.5
..LN427:
        je        ..B4.17       # Prob 50%                      #123.5
..LN428:
                                # LOE rbx
..B4.15:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN429:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN430:
        xorl      %edx, %edx                                    #123.5
..LN431:
        incq      %rdx                                          #123.5
..LN432:
        xorl      %eax, %eax                                    #123.5
..LN433:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
        call      __kmpc_push_num_threads                       #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
                                # LOE rbx
..B4.16:                        # Preds ..B4.15
                                # Execution count [0.00e+00]
..LN436:
        movl      $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN438:
        movl      $3, %esi                                      #123.5
..LN439:
        lea       8(%rsp), %rcx                                 #123.5
..LN440:
        xorl      %eax, %eax                                    #123.5
..LN441:
        lea       56(%rcx), %r8                                 #123.5
..LN442:
        lea       112(%rcx), %r9                                #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
        call      __kmpc_fork_call                              #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
        jmp       ..B4.20       # Prob 100%                     #123.5
..LN445:
                                # LOE rbx
..B4.17:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN446:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN447:
        xorl      %eax, %eax                                    #123.5
..LN448:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
        call      __kmpc_serialized_parallel                    #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
                                # LOE rbx
..B4.18:                        # Preds ..B4.17
                                # Execution count [0.00e+00]
..LN451:
        movl      $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi    #123.5
..LN452:
        lea       208(%rsp), %rdi                               #123.5
..LN453:
        lea       8(%rsp), %rdx                                 #123.5
..LN454:
        lea       56(%rdx), %rcx                                #123.5
..LN455:
        lea       112(%rdx), %r8                                #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
        call      L__Z12do_timed_runRKmRd_123__par_region1_2.5  #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
                                # LOE rbx
..B4.19:                        # Preds ..B4.18
                                # Execution count [0.00e+00]
..LN458:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN459:
        xorl      %eax, %eax                                    #123.5
..LN460:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
        call      __kmpc_end_serialized_parallel                #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
                                # LOE rbx
..B4.20:                        # Preds ..B4.16 ..B4.19
                                # Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
    .loc    1  128  is_stmt 1
#       omp_get_wtime()
        call      omp_get_wtime                                 #128.23

如您所见，输出非常冗长且难以阅读。

原文

Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.

Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.

Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6 (n^3 for matrix-matrix multiplication, performed on n slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.

Code snippet in main function:

// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
    #pragma noinline
    do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);

Code snippet: do_timed_run:

void do_timed_run(const std::size_t& n, double& avg_dur)
{
    // create the data and lay first touch
    auto operand0 = matrix<double>(n, n);
    auto operand1 = tensor<double>(n, n, n);
    auto result = tensor<double>(n, n, n);
    
    // first touch
    #pragma omp parallel
    {
        set_first_touch(operand1);
        set_first_touch(result);
    }
    
    // do the experiment
    const auto dur1 = omp_get_wtime() * 1E+6;
    #pragma omp parallel firstprivate(operand0)
    {
        #pragma noinline
        transp_matrix_tensor_mult(operand0, operand1, result);
    }
    const auto dur2 = omp_get_wtime() * 1E+6;
    avg_dur += dur2 - dur1;
}

Notes:

At this point, I'm not providing the code for the function transp_matrix_tensor_mult because I don't think it is relevant.
the #pragma noinline is a debug fixture I'm using to be able to better understand the output of the disassembler.

And now for the disassembly of the function do_timed_run:

0000000000403a20 <_Z12do_timed_runRKmRd>:
  403a20:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403a27:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403a2e:   00 
  403a2f:   48 89 fd                mov    %rdi,%rbp
  403a32:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403a39:   00 
  403a3a:   48 89 f3                mov    %rsi,%rbx
  403a3d:   48 89 ee                mov    %rbp,%rsi
  403a40:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403a45:   48 89 ea                mov    %rbp,%rdx
  403a48:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403a4f:   00 
  403a50:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403a57:   00 
  403a58:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403a5f:   00 
  403a60:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403a67:   00 
  403a68:   e8 03 f8 ff ff          callq  403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
  403a6d:   48 89 ee                mov    %rbp,%rsi
  403a70:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403a75:   48 89 ea                mov    %rbp,%rdx
  403a78:   48 89 e9                mov    %rbp,%rcx
  403a7b:   e8 80 f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a80:   48 89 ee                mov    %rbp,%rsi
  403a83:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403a88:   48 89 ea                mov    %rbp,%rdx
  403a8b:   48 89 e9                mov    %rbp,%rcx
  403a8e:   e8 6d f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a93:   bf 88 f3 44 00          mov    $0x44f388,%edi
  403a98:   e8 53 f7 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403a9d:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403aa4:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403aa9:   33 c0                   xor    %eax,%eax
  403aab:   e8 20 f6 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403ab0:   85 c0                   test   %eax,%eax
  403ab2:   74 21                   je     403ad5 <_Z12do_timed_runRKmRd+0xb5>
  403ab4:   ba a5 3c 40 00          mov    $0x403ca5,%edx
  403ab9:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403abe:   be 02 00 00 00          mov    $0x2,%esi
  403ac3:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403ac8:   33 c0                   xor    %eax,%eax
  403aca:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403ace:   e8 cd f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403ad3:   eb 41                   jmp    403b16 <_Z12do_timed_runRKmRd+0xf6>
  403ad5:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403ada:   33 c0                   xor    %eax,%eax
  403adc:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403ae3:   e8 58 f7 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403ae8:   be 9c 13 47 00          mov    $0x47139c,%esi
  403aed:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403af4:   00 
  403af5:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403afa:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403afe:   e8 a2 01 00 00          callq  403ca5 <_Z12do_timed_runRKmRd+0x285>
  403b03:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403b08:   33 c0                   xor    %eax,%eax
  403b0a:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b11:   e8 aa f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b16:   e8 85 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b1b:   c5 fb 11 04 24          vmovsd %xmm0,(%rsp)
  403b20:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b25:   33 c0                   xor    %eax,%eax
  403b27:   e8 a4 f5 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403b2c:   85 c0                   test   %eax,%eax
  403b2e:   74 25                   je     403b55 <_Z12do_timed_runRKmRd+0x135>
  403b30:   ba 0b 3c 40 00          mov    $0x403c0b,%edx
  403b35:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b3a:   be 03 00 00 00          mov    $0x3,%esi
  403b3f:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403b44:   33 c0                   xor    %eax,%eax
  403b46:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403b4a:   4c 8d 49 70             lea    0x70(%rcx),%r9
  403b4e:   e8 4d f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403b53:   eb 45                   jmp    403b9a <_Z12do_timed_runRKmRd+0x17a>
  403b55:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b5a:   33 c0                   xor    %eax,%eax
  403b5c:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b63:   e8 d8 f6 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403b68:   be a0 13 47 00          mov    $0x4713a0,%esi
  403b6d:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403b74:   00 
  403b75:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403b7a:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403b7e:   4c 8d 42 70             lea    0x70(%rdx),%r8
  403b82:   e8 84 00 00 00          callq  403c0b <_Z12do_timed_runRKmRd+0x1eb>
  403b87:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b8c:   33 c0                   xor    %eax,%eax
  403b8e:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b95:   e8 26 f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b9a:   e8 01 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b9f:   c5 fb 5c 0c 24          vsubsd (%rsp),%xmm0,%xmm1
  403ba4:   c5 fb 10 05 cc c4 01    vmovsd 0x1c4cc(%rip),%xmm0        # 420078 <alpha_beta.61562.0.0.28+0x28>
  403bab:   00 
  403bac:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403bb1:   c4 e2 f9 a9 0b          vfmadd213sd (%rbx),%xmm0,%xmm1
  403bb6:   c5 fb 11 0b             vmovsd %xmm1,(%rbx)
  403bba:   e8 71 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bbf:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403bc4:   e8 67 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bc9:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403bce:   e8 5d f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bd3:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403bda:   00 
  403bdb:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403be2:   00 
  403be3:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403bea:   00 
  403beb:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403bf2:   00 
  403bf3:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403bfa:   00 
  403bfb:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c02:   00 
  403c03:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403c0a:   c3                      retq   
  403c0b:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403c12:   4c 89 c6                mov    %r8,%rsi
  403c15:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403c1c:   00 
  403c1d:   4c 8d 24 24             lea    (%rsp),%r12
  403c21:   4c 89 e7                mov    %r12,%rdi
  403c24:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403c2b:   00 
  403c2c:   48 89 cd                mov    %rcx,%rbp
  403c2f:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403c36:   00 
  403c37:   48 89 d3                mov    %rdx,%rbx
  403c3a:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403c41:   00 
  403c42:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403c49:   00 
  403c4a:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403c51:   00 
  403c52:   e8 49 03 00 00          callq  403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
  403c57:   4c 89 e7                mov    %r12,%rdi
  403c5a:   48 89 de                mov    %rbx,%rsi
  403c5d:   48 89 ea                mov    %rbp,%rdx
  403c60:   e8 8b 01 00 00          callq  403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
  403c65:   4c 89 e7                mov    %r12,%rdi
  403c68:   e8 63 01 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>     # <--- ...and here it ends
  403c6d:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403c74:   00 
  403c75:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403c7c:   00 
  403c7d:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403c84:   00 
  403c85:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403c8c:   00 
  403c8d:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403c94:   00 
  403c95:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c9c:   00 
  403c9d:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403ca4:   c3                      retq   
  403ca5:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403cac:   48 89 d7                mov    %rdx,%rdi
  403caf:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403cb6:   00 
  403cb7:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403cbe:   00 
  403cbf:   48 89 cb                mov    %rcx,%rbx
  403cc2:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403cc9:   00 
  403cca:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403cd1:   00 
  403cd2:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403cd9:   00 
  403cda:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403ce1:   00 
  403ce2:   e8 99 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
  403ce7:   48 89 df                mov    %rbx,%rdi
  403cea:   e8 91 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
  403cef:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403cf6:   00 
  403cf7:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403cfe:   00 
  403cff:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403d06:   00 
  403d07:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403d0e:   00 
  403d0f:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403d16:   00 
  403d17:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403d1e:   00 
  403d1f:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403d26:   c3                      retq   
  403d27:   48 89 04 24             mov    %rax,(%rsp)
  403d2b:   bf 30 f4 44 00          mov    $0x44f430,%edi
  403d30:   e8 bb f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d35:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d3c:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403d41:   e8 9a 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d46:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403d4b:   e8 90 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d50:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403d55:   e8 76 00 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>
  403d5a:   48 8b 3c 24             mov    (%rsp),%rdi
  403d5e:   e8 5d f3 ff ff          callq  4030c0 <_Unwind_Resume@plt>
  403d63:   48 89 04 24             mov    %rax,(%rsp)
  403d67:   bf 68 f4 44 00          mov    $0x44f468,%edi
  403d6c:   e8 7f f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d71:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d78:   eb cc                   jmp    403d46 <_Z12do_timed_runRKmRd+0x326>
  403d7a:   48 89 04 24             mov    %rax,(%rsp)
  403d7e:   bf a0 f4 44 00          mov    $0x44f4a0,%edi
  403d83:   e8 68 f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d88:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d8f:   eb bf                   jmp    403d50 <_Z12do_timed_runRKmRd+0x330>
  403d91:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  403d98:   00 
  403d99:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

Primary questions:

Am I right in assuming that the function is being called outside the timed region?
If the above is true, why is this happening?
If the above isn't true, how can I find out why my benchmark is faulty?

Secondary questions:

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Please consider that I have only provided the information which I think is relevant. Additional information will be gladly provided upon request. Thank you in advance for your time.

Edit:

@PeterCordes I did build with debug symbols enabled. The assembly posted above has been obtained using objdump, which somehow did not retrieve the required symbols. Here's (a snippet of) the assembly obtained using icpc:

#       omp_get_wtime()
        call      omp_get_wtime                                 #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
                                # LOE rbx xmm0
..B4.12:                        # Preds ..B4.11
                                # Execution count [1.00e+00]
..LN420:
        vmovsd    %xmm0, (%rsp)                                 #122.23[spill]
..LN421:
                                # LOE rbx
..B4.13:                        # Preds ..B4.12
                                # Execution count [1.00e+00]
..LN422:
    .loc    1  123  is_stmt 1
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN423:
        xorl      %eax, %eax                                    #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
        call      __kmpc_ok_to_fork                             #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
                                # LOE rbx eax
..B4.14:                        # Preds ..B4.13
                                # Execution count [1.00e+00]
..LN426:
        testl     %eax, %eax                                    #123.5
..LN427:
        je        ..B4.17       # Prob 50%                      #123.5
..LN428:
                                # LOE rbx
..B4.15:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN429:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN430:
        xorl      %edx, %edx                                    #123.5
..LN431:
        incq      %rdx                                          #123.5
..LN432:
        xorl      %eax, %eax                                    #123.5
..LN433:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
        call      __kmpc_push_num_threads                       #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
                                # LOE rbx
..B4.16:                        # Preds ..B4.15
                                # Execution count [0.00e+00]
..LN436:
        movl      $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN438:
        movl      $3, %esi                                      #123.5
..LN439:
        lea       8(%rsp), %rcx                                 #123.5
..LN440:
        xorl      %eax, %eax                                    #123.5
..LN441:
        lea       56(%rcx), %r8                                 #123.5
..LN442:
        lea       112(%rcx), %r9                                #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
        call      __kmpc_fork_call                              #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
        jmp       ..B4.20       # Prob 100%                     #123.5
..LN445:
                                # LOE rbx
..B4.17:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN446:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN447:
        xorl      %eax, %eax                                    #123.5
..LN448:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
        call      __kmpc_serialized_parallel                    #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
                                # LOE rbx
..B4.18:                        # Preds ..B4.17
                                # Execution count [0.00e+00]
..LN451:
        movl      $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi    #123.5
..LN452:
        lea       208(%rsp), %rdi                               #123.5
..LN453:
        lea       8(%rsp), %rdx                                 #123.5
..LN454:
        lea       56(%rdx), %rcx                                #123.5
..LN455:
        lea       112(%rdx), %r8                                #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
        call      L__Z12do_timed_runRKmRd_123__par_region1_2.5  #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
                                # LOE rbx
..B4.19:                        # Preds ..B4.18
                                # Execution count [0.00e+00]
..LN458:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN459:
        xorl      %eax, %eax                                    #123.5
..LN460:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
        call      __kmpc_end_serialized_parallel                #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
                                # LOE rbx
..B4.20:                        # Preds ..B4.16 ..B4.19
                                # Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
    .loc    1  128  is_stmt 1
#       omp_get_wtime()
        call      omp_get_wtime                                 #128.23

As you can see, the output is very verbose and harder to read.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

许你一世情深 2025-01-24 07:08:06

对于现代超标量 CPU 来说，每个核心时钟周期执行 1 次 FP 操作是可悲的。 Skylake 衍生的 CPU 实际上可以在每个内核每个时钟执行 2x 4 宽 SIMD 双精度 FMA 操作，每个 FMA 算作 2 个 FLOP，因此理论最大值 = 每个内核时钟 16 个双精度 FLOP，因此 24 * 16 = 384 GFLOP/秒。（使用 4 个 double 向量，即 256 位宽 AVX）。请参阅sandy-bridge 和 haswell SSE2/AVX/AVX2 每个周期的 FLOPS

有一个a函数调用在定时区域内，callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>（以及__kmpc_end_serialized_parallel 东西）。

没有与该调用目标关联的符号，所以我猜您没有在启用调试信息的情况下进行编译。（这与优化级别是分开的，例如 gcc -g -O3 -march=native -fopenmp 应该运行相同的 asm，只是有更多的调试元数据。）即使是 OpenMP 发明的函数也应该有一个符号名称在某个时刻相关联。

就基准有效性而言，一个好的试金石是它是否可以随着问题的规模合理地扩展。除非您超出了 L3 缓存大小，或者没有出现较小或较大的问题，否则时间应该以某种合理的方式改变。如果没有，那么您会担心它会优化，或者时钟速度预热效果（惯用语性能评估的方式？为此以及更多，例如页面错误。）

为什么代码中存在非条件跳转（403ad3、403b53、403d78 和 403d8f）？

一旦您已经处于 if 块中，您就无条件地知道 else 块不应运行，因此您可以 jmp 覆盖它而不是 >jcc（即使仍然设置了 FLAGS，因此您不必再次测试条件）。或者，您将一个或另一个块放在行外（例如在函数末尾或入口点之前）并将 jcc 放入其中，然后它 jmp之后又回到另一边。这使得快速路径是连续的，没有分支。

为什么同一函数中有 3 个 retq 实例，只有一个返回路径（位于 403c0a、403ca4 和 403d26）？

重复的ret来自“尾部重复”优化，其中所有返回的多个执行路径都可以只获取自己的ret，而不是跳转到ret >。（以及任何必要的清理的副本，例如恢复寄存器和堆栈指针。）

1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384 GFLOP/S. (Using vectors of 4 doubles, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb> (as well as the __kmpc_end_serialized_parallel stuff).

There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.

As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?

Once you're already in an if block, you unconditionally know the else block should not run, so you jmp over it instead of jcc (even if FLAGS were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc to it, then it jmps back to after the other side. That allows the fast path to be contiguous with no taken branches.

Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Duplicate ret comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret instead of jumping to a ret. (And copies of any cleanup necessary, like restoring regs and stack pointer.)