Linux perf 中的运行时间和报告的周期计数
我在 4 核 Intel CPU(每个核心 1 个线程)上运行了单线程矩阵乘法,但 perf 中的数字没有意义。
Performance counter stats for 'system wide':
31,728,397,287 cpu-cycles # 0.462 GHz
131,661,730,104 ref-cycles # 1916.425 M/sec
68,701.58 msec cpu-clock # 4.000 CPUs utilized
68,701.90 msec task-clock # 4.000 CPUs utilized
31,728,553,882 cpu/cpu-cycles/ # 461.830 M/sec
17.176244725 seconds time elapsed
我已将 cpu 频率设置为最低并观察,所有核心都以 800MHz 运行。这意味着 1 个周期为 1.25ns。总 CPU 周期为 31,728,397,287,执行时间应为 39.66 秒,但运行时间为 17.1 秒。
我也不知道为什么 0.462 GHz 写在 cpu-cycles 前面。
有关处理器的更多信息:
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping: 3
CPU MHz: 800.022
CPU max MHz: 3900,0000
CPU min MHz: 800,0000
对此有何想法?
更新:
我使用 root 访问权限重新运行实验并指定用户代码。
# perf stat -a -e cycles:u,cycles,cpu-clock ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450
Performance counter stats for 'system wide':
14,237,521,876 cycles:u # 0.230 GHz
17,470,220,108 cycles # 0.282 GHz
61,974.41 msec cpu-clock # 4.000 CPUs utilized
15.494002570 seconds time elapsed
正如你所看到的,频率仍然不是800MHz。但是,如果我不指定 -a
,结果是有意义的,因为 Cycles:u * (1/800MHz) 几乎与经过的时间相同。
# perf stat -e cycles:u,cycles,cpu-clock ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
Performance counter stats for './mm_double_omp 1':
13.135.516.694 cycles:u # 0,794 GHz
13.201.778.987 cycles # 0,798 GHz
16.541,22 msec cpu-clock # 1,000 CPUs utilized
16,544487905 seconds time elapsed
16,522146000 seconds user
0,019997000 seconds sys
I have run a single thread matrix multiplication on a 4-core Intel CPU (1 thread per core), but the numbers from perf doesn't make sense.
Performance counter stats for 'system wide':
31,728,397,287 cpu-cycles # 0.462 GHz
131,661,730,104 ref-cycles # 1916.425 M/sec
68,701.58 msec cpu-clock # 4.000 CPUs utilized
68,701.90 msec task-clock # 4.000 CPUs utilized
31,728,553,882 cpu/cpu-cycles/ # 461.830 M/sec
17.176244725 seconds time elapsed
I have set the cpu frequency to minimum and watched it so, all cores were running at 800MHz. That means 1 cycle is 1.25ns. With the total cpu cycles 31,728,397,287 the execution time should be 39.66 seconds, but the run time is 17.1 seconds.
I also don't know why 0.462 GHz is written in front of cpu-cycles.
More information about the processor:
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping: 3
CPU MHz: 800.022
CPU max MHz: 3900,0000
CPU min MHz: 800,0000
Any thoughts about that?
UPDATE:
I rerun the experiment with root access and specifying the user code.
# perf stat -a -e cycles:u,cycles,cpu-clock ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450
Performance counter stats for 'system wide':
14,237,521,876 cycles:u # 0.230 GHz
17,470,220,108 cycles # 0.282 GHz
61,974.41 msec cpu-clock # 4.000 CPUs utilized
15.494002570 seconds time elapsed
As you can see the frequency is still not 800MHz. However, if I don't specify -a
the result makes sense because cycles:u * (1/800MHz) is nearly nearly the same as elapsed time.
# perf stat -e cycles:u,cycles,cpu-clock ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
Performance counter stats for './mm_double_omp 1':
13.135.516.694 cycles:u # 0,794 GHz
13.201.778.987 cycles # 0,798 GHz
16.541,22 msec cpu-clock # 1,000 CPUs utilized
16,544487905 seconds time elapsed
16,522146000 seconds user
0,019997000 seconds sys
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
但是您在运行时对所有 4 个核心的系统范围进行了计数。这就是什么
perf -a
确实如此,以及为什么您在输出中获得了Performance counter stats for 'system Wide':
。因此,只要进程处于活动状态,内核的任务时钟“事件”就会计算所有 4 个内核的时间:
这是 68.7 个 CPU 秒,几乎正好是 4x 17.17 秒,这是有道理的。
这是代码处于活动状态期间所有 4 个核心的平均周期/时间。对于处于睡眠状态的 CPU,时钟不会滴答作响,因此在您的代码保持 1 个核心忙碌期间,平均负载似乎为
4 * 462/800 = 2.31
处于睡眠状态的 CPU。因此,您在系统平均运行另外 1.3 个线程时进行基准测试。 (包括唤醒中断的 CPU 时间)
我假设您的测量不限于仅用户空间(不包括在内核模式下花费的时间);这是另一种可能的影响。
如果您没有使用
-a
,如果您只计算用户空间周期而不是内核中花费的周期,仍然可能会出现低于硬件运行频率的情况。 (例如perf stat --all-user
,或者将kernel.perf_event_paranoid
设置得足够高,以防止您分析有效应用的内核中的硬件事件: u
到硬件事件,但不将它们打印为cycles:u
。)如果将
kernel.perf_event_paranoid
sysctl 设置更改为 0,或运行作为根用户,您应该看到 800 MHz 为周期/任务时钟,因为这是线程运行时 CPU 运行的速度。在用户空间之外花费大量周期表明您的代码在内核中花费了大量时间,如果您使用大数组,则可能会处理页面错误。
But then you counted system-wide across all 4 cores for the time when it was running. That's what
perf -a
does, and why you gotPerformance counter stats for 'system wide':
in your ouput.So the kernel's
task-clock
"event" was counting time across all 4 cores whenever your process was active:That's 68.7 CPU-seconds, almost exactly 4x 17.17 s, which makes sense.
That's the average cycles/time across all 4 cores during the time your code was active. The clock doesn't tick for CPUs that are asleep, so it seems the load average was
4 * 462/800 = 2.31
CPUs not asleep during the time your code was keeping 1 core busy.So you were benchmarking while your system was running another 1.3 threads on average. (Including CPU time waking up for interrupts)
I'm assuming your measurements were not limited to user-space-only (excluding ticks spent in kernel mode); that's another possible effect.
If you hadn't used
-a
, a lower frequency than the HW was running at can still happen if you count only user-space cycles, not cycles spent in the kernel. (e.g.perf stat --all-user
, or with yourkernel.perf_event_paranoid
setting high enough to not let you profile HW events in the kernel which effectively applies:u
to HW events but without printing them ascycles:u
.)If you change your
kernel.perf_event_paranoid
sysctl setting to 0, or runperf
as root, you should see exactly 800 MHz as cycles / task-clock, since that's the speed the CPU was running when your threads were running.Spending a lot of cycles not in user-space would indicate that your code is spending a lot of time in the kernel, probably handling page faults if you're using a big array.