Linux perf 中的运行时间和报告的周期计数

发布于 2025-01-12 05:15:00 字数 2602 浏览 1 评论 0原文

我在 4 核 Intel CPU(每个核心 1 个线程)上运行了单线程矩阵乘法,但 perf 中的数字没有意义。

 Performance counter stats for 'system wide':

    31,728,397,287      cpu-cycles                #    0.462 GHz                    
   131,661,730,104      ref-cycles                # 1916.425 M/sec                  
         68,701.58 msec cpu-clock                 #    4.000 CPUs utilized          
         68,701.90 msec task-clock                #    4.000 CPUs utilized          
    31,728,553,882      cpu/cpu-cycles/           #  461.830 M/sec                  

      17.176244725 seconds time elapsed

我已将 cpu 频率设置为最低并观察,所有核心都以 800MHz 运行。这意味着 1 个周期为 1.25ns。总 CPU 周期为 31,728,397,287,执行时间应为 39.66 秒,但运行时间为 17.1 秒。

我也不知道为什么 0.462 GHz 写在 cpu-cycles 前面。

有关处理器的更多信息:

Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         800.022
CPU max MHz:                     3900,0000
CPU min MHz:                     800,0000

对此有何想法?

更新:

我使用 root 访问权限重新运行实验并指定用户代码。

# perf stat -a -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450

 Performance counter stats for 'system wide':

    14,237,521,876      cycles:u                  #    0.230 GHz                    
    17,470,220,108      cycles                    #    0.282 GHz                    
         61,974.41 msec cpu-clock                 #    4.000 CPUs utilized          

      15.494002570 seconds time elapsed

正如你所看到的,频率仍然不是800MHz。但是,如果我不指定 -a ,结果是有意义的,因为 Cycles:u * (1/800MHz) 几乎与经过的时间相同。

# perf stat -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
 Performance counter stats for './mm_double_omp 1':

    13.135.516.694      cycles:u                  #    0,794 GHz                    
    13.201.778.987      cycles                    #    0,798 GHz                    
         16.541,22 msec cpu-clock                 #    1,000 CPUs utilized          

      16,544487905 seconds time elapsed

      16,522146000 seconds user
       0,019997000 seconds sys

I have run a single thread matrix multiplication on a 4-core Intel CPU (1 thread per core), but the numbers from perf doesn't make sense.

 Performance counter stats for 'system wide':

    31,728,397,287      cpu-cycles                #    0.462 GHz                    
   131,661,730,104      ref-cycles                # 1916.425 M/sec                  
         68,701.58 msec cpu-clock                 #    4.000 CPUs utilized          
         68,701.90 msec task-clock                #    4.000 CPUs utilized          
    31,728,553,882      cpu/cpu-cycles/           #  461.830 M/sec                  

      17.176244725 seconds time elapsed

I have set the cpu frequency to minimum and watched it so, all cores were running at 800MHz. That means 1 cycle is 1.25ns. With the total cpu cycles 31,728,397,287 the execution time should be 39.66 seconds, but the run time is 17.1 seconds.

I also don't know why 0.462 GHz is written in front of cpu-cycles.

More information about the processor:

Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         800.022
CPU max MHz:                     3900,0000
CPU min MHz:                     800,0000

Any thoughts about that?

UPDATE:

I rerun the experiment with root access and specifying the user code.

# perf stat -a -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450

 Performance counter stats for 'system wide':

    14,237,521,876      cycles:u                  #    0.230 GHz                    
    17,470,220,108      cycles                    #    0.282 GHz                    
         61,974.41 msec cpu-clock                 #    4.000 CPUs utilized          

      15.494002570 seconds time elapsed

As you can see the frequency is still not 800MHz. However, if I don't specify -a the result makes sense because cycles:u * (1/800MHz) is nearly nearly the same as elapsed time.

# perf stat -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
 Performance counter stats for './mm_double_omp 1':

    13.135.516.694      cycles:u                  #    0,794 GHz                    
    13.201.778.987      cycles                    #    0,798 GHz                    
         16.541,22 msec cpu-clock                 #    1,000 CPUs utilized          

      16,544487905 seconds time elapsed

      16,522146000 seconds user
       0,019997000 seconds sys

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

素衣风尘叹 2025-01-19 05:15:00

我运行了单线程矩阵乘法

但是您在运行时对所有 4 个核心的系统范围进行了计数。这就是什么perf -a 确实如此,以及为什么您在输出中获得了Performance counter stats for 'system Wide':

因此,只要进程处于活动状态,内核的任务时钟“事件”就会计算所有 4 个内核的时间:

68,701.90 毫秒任务时钟 # 使用 4.000 个 CPU

这是 68.7 个 CPU 秒,几乎正好是 4x 17.17 秒,这是有道理的。


我也不知道为什么 0.462 GHz 写在 cpu-cycles 前面。

这是代码处于活动状态期间所有 4 个核心的平均周期/时间。对于处于睡眠状态的 CPU,时钟不会滴答作响,因此在您的代码保持 1 个核心忙碌期间,平均负载似乎为 4 * 462/800 = 2.31 处于睡眠状态的 CPU。

因此,您在系统平均运行另外 1.3 个线程时进行基准测试。 (包括唤醒中断的 CPU 时间)

我假设您的测量不限于仅用户空间(不包括在内核模式下花费的时间);这是另一种可能的影响。


如果您没有使用 -a,如果您只计算用户空间周期而不是内核中花费的周期,仍然可能会出现低于硬件运行频率的情况。 (例如 perf stat --all-user,或者将 kernel.perf_event_paranoid 设置得足够高,以防止您分析有效应用 的内核中的硬件事件: u 到硬件事件,但不将它们打印为 cycles:u。)

如果将 kernel.perf_event_paranoid sysctl 设置更改为 0,或运行作为根用户,您应该看到 800 MHz 为周期/任务时钟,因为这是线程运行时 CPU 运行的速度。

在用户空间之外花费大量周期表明您的代码在内核中花费了大量时间,如果您使用大数组,则可能会处理页面错误。

I have run a single thread matrix multiplication

But then you counted system-wide across all 4 cores for the time when it was running. That's what perf -a does, and why you got Performance counter stats for 'system wide': in your ouput.

So the kernel's task-clock "event" was counting time across all 4 cores whenever your process was active:

68,701.90 msec task-clock # 4.000 CPUs utilized

That's 68.7 CPU-seconds, almost exactly 4x 17.17 s, which makes sense.


I also don't know why 0.462 GHz is written in front of cpu-cycles.

That's the average cycles/time across all 4 cores during the time your code was active. The clock doesn't tick for CPUs that are asleep, so it seems the load average was 4 * 462/800 = 2.31 CPUs not asleep during the time your code was keeping 1 core busy.

So you were benchmarking while your system was running another 1.3 threads on average. (Including CPU time waking up for interrupts)

I'm assuming your measurements were not limited to user-space-only (excluding ticks spent in kernel mode); that's another possible effect.


If you hadn't used -a, a lower frequency than the HW was running at can still happen if you count only user-space cycles, not cycles spent in the kernel. (e.g. perf stat --all-user, or with your kernel.perf_event_paranoid setting high enough to not let you profile HW events in the kernel which effectively applies :u to HW events but without printing them as cycles:u.)

If you change your kernel.perf_event_paranoid sysctl setting to 0, or run perf as root, you should see exactly 800 MHz as cycles / task-clock, since that's the speed the CPU was running when your threads were running.

Spending a lot of cycles not in user-space would indicate that your code is spending a lot of time in the kernel, probably handling page faults if you're using a big array.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文