Linux perf 中的运行时间和报告的周期计数

发布于 2025-01-12 05:15:00 字数 2602 浏览 1 评论 0原文

我在 4 核 Intel CPU（每个核心 1 个线程）上运行了单线程矩阵乘法，但 perf 中的数字没有意义。

 Performance counter stats for 'system wide':

    31,728,397,287      cpu-cycles                #    0.462 GHz                    
   131,661,730,104      ref-cycles                # 1916.425 M/sec                  
         68,701.58 msec cpu-clock                 #    4.000 CPUs utilized          
         68,701.90 msec task-clock                #    4.000 CPUs utilized          
    31,728,553,882      cpu/cpu-cycles/           #  461.830 M/sec                  

      17.176244725 seconds time elapsed

我已将 cpu 频率设置为最低并观察，所有核心都以 800MHz 运行。这意味着 1 个周期为 1.25ns。总 CPU 周期为 31,728,397,287，执行时间应为 39.66 秒，但运行时间为 17.1 秒。

我也不知道为什么 0.462 GHz 写在 cpu-cycles 前面。

有关处理器的更多信息：

Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         800.022
CPU max MHz:                     3900,0000
CPU min MHz:                     800,0000

对此有何想法？

更新：

我使用 root 访问权限重新运行实验并指定用户代码。

# perf stat -a -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450

 Performance counter stats for 'system wide':

    14,237,521,876      cycles:u                  #    0.230 GHz                    
    17,470,220,108      cycles                    #    0.282 GHz                    
         61,974.41 msec cpu-clock                 #    4.000 CPUs utilized          

      15.494002570 seconds time elapsed

正如你所看到的，频率仍然不是800MHz。但是，如果我不指定 -a ，结果是有意义的，因为 Cycles:u * (1/800MHz) 几乎与经过的时间相同。

# perf stat -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
 Performance counter stats for './mm_double_omp 1':

    13.135.516.694      cycles:u                  #    0,794 GHz                    
    13.201.778.987      cycles                    #    0,798 GHz                    
         16.541,22 msec cpu-clock                 #    1,000 CPUs utilized          

      16,544487905 seconds time elapsed

      16,522146000 seconds user
       0,019997000 seconds sys

原文

I have run a single thread matrix multiplication on a 4-core Intel CPU (1 thread per core), but the numbers from perf doesn't make sense.

 Performance counter stats for 'system wide':

    31,728,397,287      cpu-cycles                #    0.462 GHz                    
   131,661,730,104      ref-cycles                # 1916.425 M/sec                  
         68,701.58 msec cpu-clock                 #    4.000 CPUs utilized          
         68,701.90 msec task-clock                #    4.000 CPUs utilized          
    31,728,553,882      cpu/cpu-cycles/           #  461.830 M/sec                  

      17.176244725 seconds time elapsed

I have set the cpu frequency to minimum and watched it so, all cores were running at 800MHz. That means 1 cycle is 1.25ns. With the total cpu cycles 31,728,397,287 the execution time should be 39.66 seconds, but the run time is 17.1 seconds.

I also don't know why 0.462 GHz is written in front of cpu-cycles.

More information about the processor:

Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         800.022
CPU max MHz:                     3900,0000
CPU min MHz:                     800,0000

Any thoughts about that?

UPDATE:

I rerun the experiment with root access and specifying the user code.

# perf stat -a -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450

 Performance counter stats for 'system wide':

    14,237,521,876      cycles:u                  #    0.230 GHz                    
    17,470,220,108      cycles                    #    0.282 GHz                    
         61,974.41 msec cpu-clock                 #    4.000 CPUs utilized          

      15.494002570 seconds time elapsed

As you can see the frequency is still not 800MHz. However, if I don't specify -a the result makes sense because cycles:u * (1/800MHz) is nearly nearly the same as elapsed time.

# perf stat -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
 Performance counter stats for './mm_double_omp 1':

    13.135.516.694      cycles:u                  #    0,794 GHz                    
    13.201.778.987      cycles                    #    0,798 GHz                    
         16.541,22 msec cpu-clock                 #    1,000 CPUs utilized          

      16,544487905 seconds time elapsed

      16,522146000 seconds user
       0,019997000 seconds sys

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

素衣风尘叹 2025-01-19 05:15:00

我运行了单线程矩阵乘法

但是您在运行时对所有 4 个核心的系统范围进行了计数。这就是什么perf -a 确实如此，以及为什么您在输出中获得了Performance counter stats for 'system Wide':。

因此，只要进程处于活动状态，内核的任务时钟“事件”就会计算所有 4 个内核的时间：

68,701.90 毫秒任务时钟 # 使用 4.000 个 CPU

这是 68.7 个 CPU 秒，几乎正好是 4x 17.17 秒，这是有道理的。

我也不知道为什么 0.462 GHz 写在 cpu-cycles 前面。

这是代码处于活动状态期间所有 4 个核心的平均周期/时间。对于处于睡眠状态的 CPU，时钟不会滴答作响，因此在您的代码保持 1 个核心忙碌期间，平均负载似乎为 4 * 462/800 = 2.31 处于睡眠状态的 CPU。

因此，您在系统平均运行另外 1.3 个线程时进行基准测试。（包括唤醒中断的 CPU 时间）

我假设您的测量不限于仅用户空间（不包括在内核模式下花费的时间）；这是另一种可能的影响。

如果您没有使用 -a，如果您只计算用户空间周期而不是内核中花费的周期，仍然可能会出现低于硬件运行频率的情况。（例如 perf stat --all-user，或者将 kernel.perf_event_paranoid 设置得足够高，以防止您分析有效应用 的内核中的硬件事件： u 到硬件事件，但不将它们打印为 cycles:u。）

如果将 kernel.perf_event_paranoid sysctl 设置更改为 0，或运行作为根用户，您应该看到 800 MHz 为周期/任务时钟，因为这是线程运行时 CPU 运行的速度。

在用户空间之外花费大量周期表明您的代码在内核中花费了大量时间，如果您使用大数组，则可能会处理页面错误。