min_granularity_ns对性能的影响

发布于 2025-01-20 05:30:58 字数 2898 浏览 2 评论 0原文

为了找出内核参数min_capsularity_ns的影响，矩阵乘法的16线程OMP实现code 使用该参数的高值和低值启动。性能结果如下所示：

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_omp 16
Using 16 threads
Total execution Time in seconds: 12.3690895601
MM execution Time in seconds: 12.2312941169

 Performance counter stats for 'system wide':

            911.97 Joules power/energy-pkg/
   218,012,129,383        instructions              #    0.26  insn per cycle
   823,773,717,094        cycles
            37,701        context-switches
               131        cpu-migrations
            51,012        page-faults

      12.369310043 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_double_omp 16
Using 16 threads
Total execution Time in seconds: 12.3981724780
MM execution Time in seconds: 12.2612874920

 Performance counter stats for 'system wide':

            881.48 Joules power/energy-pkg/
   218,063,319,724        instructions              #    0.27  insn per cycle
   822,622,830,036        cycles
            37,959        context-switches
               146        cpu-migrations
            51,553        page-faults

      12.400958939 seconds time elapsed

正如您所看到的，尽管内核参数发生了很大的变化，从 1 微秒变为 1 秒，但结果没有差异。尽管除了 min_capsularity_ns 之外还有其他参数，但这种“无差异”有意义吗？或者这可能不是一个正确的测试程序？

更新 1： 我测试了另一个使用 CBLAS 的实现，它使用 16 线程。如您所见，对于较大的矩阵大小 (20k)，IPC 为 1.77，这是可以接受的。同样，通过改变 min_capsularity_ns，尽管上下文切换的数量因大粒度而减少，但时间上没有差异。

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 47.0226452020
MM execution Time in seconds: 37.1756865050

 Performance counter stats for 'system wide':

          3,106.80 Joules power/energy-pkg/
 3,943,151,227,404        instructions              #    1.77  insn per cycle
 2,230,425,316,645        cycles
           273,271        context-switches
               383        cpu-migrations
         2,360,017        page-faults

      47.272118708 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 46.8930790700
MM execution Time in seconds: 37.0639640210

 Performance counter stats for 'system wide':

          3,080.33 Joules power/energy-pkg/
 3,924,979,103,204        instructions              #    1.77  insn per cycle
 2,223,571,579,672        cycles
           125,643        context-switches
               355        cpu-migrations
         2,358,432        page-faults

      47.148148344 seconds time elapsed

我仍然想知道该参数对性能有何影响。

原文

In order to find the effect of kernel parameter, min_granularity_ns, a 16-thread OMP implementation of a matrix multiplication code is launched with high and low values of that parameter. The perf result is shown below:

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_omp 16
Using 16 threads
Total execution Time in seconds: 12.3690895601
MM execution Time in seconds: 12.2312941169

 Performance counter stats for 'system wide':

            911.97 Joules power/energy-pkg/
   218,012,129,383        instructions              #    0.26  insn per cycle
   823,773,717,094        cycles
            37,701        context-switches
               131        cpu-migrations
            51,012        page-faults

      12.369310043 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_double_omp 16
Using 16 threads
Total execution Time in seconds: 12.3981724780
MM execution Time in seconds: 12.2612874920

 Performance counter stats for 'system wide':

            881.48 Joules power/energy-pkg/
   218,063,319,724        instructions              #    0.27  insn per cycle
   822,622,830,036        cycles
            37,959        context-switches
               146        cpu-migrations
            51,553        page-faults

      12.400958939 seconds time elapsed

As you can see there is no difference between the results albeit the large change in the kernel parameter, from 1 us to 1 second. Although there are other parameters in addition to min_granularity_ns, does that "no-difference" make sense? Or maybe this is not a correct program to test?

UPDATE 1: I test another implementation which uses CBLAS and it utilizes 16-threads. As you can see, for a large matrix size (20k), the IPC is 1.77 which is acceptable. Again, by varying the min_granularity_ns, there is no difference in time, although the number of context-swtiches decreases for large granularity.

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 47.0226452020
MM execution Time in seconds: 37.1756865050

 Performance counter stats for 'system wide':

          3,106.80 Joules power/energy-pkg/
 3,943,151,227,404        instructions              #    1.77  insn per cycle
 2,230,425,316,645        cycles
           273,271        context-switches
               383        cpu-migrations
         2,360,017        page-faults

      47.272118708 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 46.8930790700
MM execution Time in seconds: 37.0639640210

 Performance counter stats for 'system wide':

          3,080.33 Joules power/energy-pkg/
 3,924,979,103,204        instructions              #    1.77  insn per cycle
 2,223,571,579,672        cycles
           125,643        context-switches
               355        cpu-migrations
         2,358,432        page-faults

      47.148148344 seconds time elapsed

Still I wonder what is the effect of that parameter on the performance.

分享到QQ

分享到微博