基准测试错误,组装令人费解
这里是组装新手。我编写了一个基准测试来测量机器在计算转置矩阵张量乘积时的浮点性能。
鉴于我的机器具有 32GiB RAM(带宽 ~37GiB/s)和 Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) 处理器,我估计最大性能(使用流水线和寄存器中的数据)为为 6 核 x 4.0GHz = 24GFLOP/s。然而,当我运行基准测试时,我测量到的速度为 127GFLOP/s,这显然是一个错误的测量结果。
注意:为了测量 FP 性能,我测量操作数:n*n*n*n*6
(n^3
用于矩阵-矩阵乘法,在n
复数数据点切片上执行,即假设 1 次复数-复数乘法需要 6 次 FLOP),并将其除以每次运行所需的平均时间。
主函数中的代码片段:
// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
#pragma noinline
do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);
代码片段:do_timed_run:
void do_timed_run(const std::size_t& n, double& avg_dur)
{
// create the data and lay first touch
auto operand0 = matrix<double>(n, n);
auto operand1 = tensor<double>(n, n, n);
auto result = tensor<double>(n, n, n);
// first touch
#pragma omp parallel
{
set_first_touch(operand1);
set_first_touch(result);
}
// do the experiment
const auto dur1 = omp_get_wtime() * 1E+6;
#pragma omp parallel firstprivate(operand0)
{
#pragma noinline
transp_matrix_tensor_mult(operand0, operand1, result);
}
const auto dur2 = omp_get_wtime() * 1E+6;
avg_dur += dur2 - dur1;
}
注释:
- 此时,我不提供函数的代码
transp_matrix_tensor_mult
因为我认为它不相关。 -
#pragma noinline
是一个调试装置,我用它来更好地理解反汇编器的输出。
现在对函数 do_timed_run
进行反汇编:
0000000000403a20 <_Z12do_timed_runRKmRd>:
403a20: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403a27: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403a2e: 00
403a2f: 48 89 fd mov %rdi,%rbp
403a32: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403a39: 00
403a3a: 48 89 f3 mov %rsi,%rbx
403a3d: 48 89 ee mov %rbp,%rsi
403a40: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403a45: 48 89 ea mov %rbp,%rdx
403a48: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403a4f: 00
403a50: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403a57: 00
403a58: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403a5f: 00
403a60: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403a67: 00
403a68: e8 03 f8 ff ff callq 403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
403a6d: 48 89 ee mov %rbp,%rsi
403a70: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403a75: 48 89 ea mov %rbp,%rdx
403a78: 48 89 e9 mov %rbp,%rcx
403a7b: e8 80 f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a80: 48 89 ee mov %rbp,%rsi
403a83: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403a88: 48 89 ea mov %rbp,%rdx
403a8b: 48 89 e9 mov %rbp,%rcx
403a8e: e8 6d f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a93: bf 88 f3 44 00 mov $0x44f388,%edi
403a98: e8 53 f7 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403a9d: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403aa4: bf c0 f3 44 00 mov $0x44f3c0,%edi
403aa9: 33 c0 xor %eax,%eax
403aab: e8 20 f6 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403ab0: 85 c0 test %eax,%eax
403ab2: 74 21 je 403ad5 <_Z12do_timed_runRKmRd+0xb5>
403ab4: ba a5 3c 40 00 mov $0x403ca5,%edx
403ab9: bf c0 f3 44 00 mov $0x44f3c0,%edi
403abe: be 02 00 00 00 mov $0x2,%esi
403ac3: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403ac8: 33 c0 xor %eax,%eax
403aca: 4c 8d 41 38 lea 0x38(%rcx),%r8
403ace: e8 cd f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403ad3: eb 41 jmp 403b16 <_Z12do_timed_runRKmRd+0xf6>
403ad5: bf c0 f3 44 00 mov $0x44f3c0,%edi
403ada: 33 c0 xor %eax,%eax
403adc: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403ae3: e8 58 f7 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403ae8: be 9c 13 47 00 mov $0x47139c,%esi
403aed: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403af4: 00
403af5: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403afa: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403afe: e8 a2 01 00 00 callq 403ca5 <_Z12do_timed_runRKmRd+0x285>
403b03: bf c0 f3 44 00 mov $0x44f3c0,%edi
403b08: 33 c0 xor %eax,%eax
403b0a: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b11: e8 aa f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b16: e8 85 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b1b: c5 fb 11 04 24 vmovsd %xmm0,(%rsp)
403b20: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b25: 33 c0 xor %eax,%eax
403b27: e8 a4 f5 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403b2c: 85 c0 test %eax,%eax
403b2e: 74 25 je 403b55 <_Z12do_timed_runRKmRd+0x135>
403b30: ba 0b 3c 40 00 mov $0x403c0b,%edx
403b35: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b3a: be 03 00 00 00 mov $0x3,%esi
403b3f: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403b44: 33 c0 xor %eax,%eax
403b46: 4c 8d 41 38 lea 0x38(%rcx),%r8
403b4a: 4c 8d 49 70 lea 0x70(%rcx),%r9
403b4e: e8 4d f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403b53: eb 45 jmp 403b9a <_Z12do_timed_runRKmRd+0x17a>
403b55: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b5a: 33 c0 xor %eax,%eax
403b5c: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b63: e8 d8 f6 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403b68: be a0 13 47 00 mov $0x4713a0,%esi
403b6d: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403b74: 00
403b75: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403b7a: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403b7e: 4c 8d 42 70 lea 0x70(%rdx),%r8
403b82: e8 84 00 00 00 callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
403b87: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b8c: 33 c0 xor %eax,%eax
403b8e: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b95: e8 26 f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b9a: e8 01 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b9f: c5 fb 5c 0c 24 vsubsd (%rsp),%xmm0,%xmm1
403ba4: c5 fb 10 05 cc c4 01 vmovsd 0x1c4cc(%rip),%xmm0 # 420078 <alpha_beta.61562.0.0.28+0x28>
403bab: 00
403bac: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403bb1: c4 e2 f9 a9 0b vfmadd213sd (%rbx),%xmm0,%xmm1
403bb6: c5 fb 11 0b vmovsd %xmm1,(%rbx)
403bba: e8 71 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bbf: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403bc4: e8 67 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bc9: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403bce: e8 5d f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bd3: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403bda: 00
403bdb: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403be2: 00
403be3: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403bea: 00
403beb: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403bf2: 00
403bf3: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403bfa: 00
403bfb: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c02: 00
403c03: 48 81 c4 d8 00 00 00 add $0xd8,%rsp
403c0a: c3 retq
403c0b: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403c12: 4c 89 c6 mov %r8,%rsi
403c15: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403c1c: 00
403c1d: 4c 8d 24 24 lea (%rsp),%r12
403c21: 4c 89 e7 mov %r12,%rdi
403c24: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403c2b: 00
403c2c: 48 89 cd mov %rcx,%rbp
403c2f: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403c36: 00
403c37: 48 89 d3 mov %rdx,%rbx
403c3a: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403c41: 00
403c42: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403c49: 00
403c4a: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403c51: 00
403c52: e8 49 03 00 00 callq 403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
403c57: 4c 89 e7 mov %r12,%rdi
403c5a: 48 89 de mov %rbx,%rsi
403c5d: 48 89 ea mov %rbp,%rdx
403c60: e8 8b 01 00 00 callq 403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
403c65: 4c 89 e7 mov %r12,%rdi
403c68: e8 63 01 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev> # <--- ...and here it ends
403c6d: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403c74: 00
403c75: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403c7c: 00
403c7d: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403c84: 00
403c85: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403c8c: 00
403c8d: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403c94: 00
403c95: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c9c: 00
403c9d: 48 81 c4 d8 00 00 00 add $0xd8,%rsp
403ca4: c3 retq
403ca5: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403cac: 48 89 d7 mov %rdx,%rdi
403caf: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403cb6: 00
403cb7: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403cbe: 00
403cbf: 48 89 cb mov %rcx,%rbx
403cc2: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403cc9: 00
403cca: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403cd1: 00
403cd2: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403cd9: 00
403cda: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403ce1: 00
403ce2: e8 99 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
403ce7: 48 89 df mov %rbx,%rdi
403cea: e8 91 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
403cef: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403cf6: 00
403cf7: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403cfe: 00
403cff: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403d06: 00
403d07: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403d0e: 00
403d0f: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403d16: 00
403d17: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403d1e: 00
403d1f: 48 81 c4 d8 00 00 00 add $0xd8,%rsp
403d26: c3 retq
403d27: 48 89 04 24 mov %rax,(%rsp)
403d2b: bf 30 f4 44 00 mov $0x44f430,%edi
403d30: e8 bb f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d35: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d3c: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403d41: e8 9a 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d46: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403d4b: e8 90 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d50: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403d55: e8 76 00 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev>
403d5a: 48 8b 3c 24 mov (%rsp),%rdi
403d5e: e8 5d f3 ff ff callq 4030c0 <_Unwind_Resume@plt>
403d63: 48 89 04 24 mov %rax,(%rsp)
403d67: bf 68 f4 44 00 mov $0x44f468,%edi
403d6c: e8 7f f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d71: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d78: eb cc jmp 403d46 <_Z12do_timed_runRKmRd+0x326>
403d7a: 48 89 04 24 mov %rax,(%rsp)
403d7e: bf a0 f4 44 00 mov $0x44f4a0,%edi
403d83: e8 68 f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d88: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d8f: eb bf jmp 403d50 <_Z12do_timed_runRKmRd+0x330>
403d91: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
403d98: 00
403d99: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
主要问题:
- 我假设该函数是在定时区域之外调用的,这样正确吗?
- 如果上述情况属实,为什么会发生这种情况?
- 如果上述情况不正确,我如何找出我的基准测试错误的原因?
第二个问题:
- 为什么代码中存在非条件跳转(在403ad3、403b53、403d78和403d8f处)?
- 为什么同一函数中有 3 个 retq 实例,只有一个返回路径(位于 403c0a、403ca4 和 403d26)?
请注意,我仅提供了我认为相关的信息。我们很乐意根据要求提供更多信息。预先感谢您抽出时间。
编辑:
@PeterCordes 我确实在启用调试符号的情况下进行构建。上面发布的程序集是使用 objdump 获得的,但不知何故没有检索到所需的符号。下面是使用 icpc 获得的程序集(片段):
# omp_get_wtime()
call omp_get_wtime #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
# LOE rbx xmm0
..B4.12: # Preds ..B4.11
# Execution count [1.00e+00]
..LN420:
vmovsd %xmm0, (%rsp) #122.23[spill]
..LN421:
# LOE rbx
..B4.13: # Preds ..B4.12
# Execution count [1.00e+00]
..LN422:
.loc 1 123 is_stmt 1
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN423:
xorl %eax, %eax #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
call __kmpc_ok_to_fork #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
# LOE rbx eax
..B4.14: # Preds ..B4.13
# Execution count [1.00e+00]
..LN426:
testl %eax, %eax #123.5
..LN427:
je ..B4.17 # Prob 50% #123.5
..LN428:
# LOE rbx
..B4.15: # Preds ..B4.14
# Execution count [0.00e+00]
..LN429:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN430:
xorl %edx, %edx #123.5
..LN431:
incq %rdx #123.5
..LN432:
xorl %eax, %eax #123.5
..LN433:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
call __kmpc_push_num_threads #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
# LOE rbx
..B4.16: # Preds ..B4.15
# Execution count [0.00e+00]
..LN436:
movl $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN438:
movl $3, %esi #123.5
..LN439:
lea 8(%rsp), %rcx #123.5
..LN440:
xorl %eax, %eax #123.5
..LN441:
lea 56(%rcx), %r8 #123.5
..LN442:
lea 112(%rcx), %r9 #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
call __kmpc_fork_call #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
jmp ..B4.20 # Prob 100% #123.5
..LN445:
# LOE rbx
..B4.17: # Preds ..B4.14
# Execution count [0.00e+00]
..LN446:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN447:
xorl %eax, %eax #123.5
..LN448:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
call __kmpc_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
# LOE rbx
..B4.18: # Preds ..B4.17
# Execution count [0.00e+00]
..LN451:
movl $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi #123.5
..LN452:
lea 208(%rsp), %rdi #123.5
..LN453:
lea 8(%rsp), %rdx #123.5
..LN454:
lea 56(%rdx), %rcx #123.5
..LN455:
lea 112(%rdx), %r8 #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
call L__Z12do_timed_runRKmRd_123__par_region1_2.5 #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
# LOE rbx
..B4.19: # Preds ..B4.18
# Execution count [0.00e+00]
..LN458:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN459:
xorl %eax, %eax #123.5
..LN460:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
call __kmpc_end_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
# LOE rbx
..B4.20: # Preds ..B4.16 ..B4.19
# Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
.loc 1 128 is_stmt 1
# omp_get_wtime()
call omp_get_wtime #128.23
如您所见,输出非常冗长且难以阅读。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于现代超标量 CPU 来说,每个核心时钟周期执行 1 次 FP 操作是可悲的。 Skylake 衍生的 CPU 实际上可以在每个内核每个时钟执行 2x 4 宽 SIMD 双精度 FMA 操作,每个 FMA 算作 2 个 FLOP,因此理论最大值 = 每个内核时钟 16 个双精度 FLOP,因此 24 * 16 = 384 GFLOP/秒。 (使用 4 个 double 向量,即 256 位宽 AVX)。请参阅sandy-bridge 和 haswell SSE2/AVX/AVX2 每个周期的 FLOPS
有一个a函数调用在定时区域内,
callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(以及__kmpc_end_serialized_parallel
东西)。没有与该调用目标关联的符号,所以我猜您没有在启用调试信息的情况下进行编译。 (这与优化级别是分开的,例如
gcc -g -O3 -march=native -fopenmp
应该运行相同的 asm,只是有更多的调试元数据。)即使是 OpenMP 发明的函数也应该有一个符号名称在某个时刻相关联。就基准有效性而言,一个好的试金石是它是否可以随着问题的规模合理地扩展。除非您超出了 L3 缓存大小,或者没有出现较小或较大的问题,否则时间应该以某种合理的方式改变。如果没有,那么您会担心它会优化,或者时钟速度预热效果(惯用语性能评估的方式?为此以及更多,例如页面错误。)
一旦您已经处于
if
块中,您就无条件地知道else
块不应运行,因此您可以jmp
覆盖它而不是>jcc
(即使仍然设置了FLAGS
,因此您不必再次测试条件)。或者,您将一个或另一个块放在行外(例如在函数末尾或入口点之前)并将jcc
放入其中,然后它jmp
之后又回到另一边。这使得快速路径是连续的,没有分支。重复的
ret
来自“尾部重复”优化,其中所有返回的多个执行路径都可以只获取自己的ret
,而不是跳转到ret
>。 (以及任何必要的清理的副本,例如恢复寄存器和堆栈指针。)1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so
24 * 16 = 384
GFLOP/S. (Using vectors of 4double
s, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2There is a a function call inside the timed region,
callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(as well as the__kmpc_end_serialized_parallel
stuff).There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g.
gcc -g -O3 -march=native -fopenmp
should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)
Once you're already in an
if
block, you unconditionally know theelse
block should not run, so youjmp
over it instead ofjcc
(even ifFLAGS
were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) andjcc
to it, then itjmp
s back to after the other side. That allows the fast path to be contiguous with no taken branches.Duplicate
ret
comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their ownret
instead of jumping to aret
. (And copies of any cleanup necessary, like restoring regs and stack pointer.)