min_granularity_ns对性能的影响

发布于 2025-01-20 05:30:58 字数 2898 浏览 2 评论 0原文

为了找出内核参数min_capsularity_ns的影响,矩阵乘法的16线程OMP实现code 使用该参数的高值和低值启动。性能结果如下所示:

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_omp 16
Using 16 threads
Total execution Time in seconds: 12.3690895601
MM execution Time in seconds: 12.2312941169

 Performance counter stats for 'system wide':

            911.97 Joules power/energy-pkg/
   218,012,129,383        instructions              #    0.26  insn per cycle
   823,773,717,094        cycles
            37,701        context-switches
               131        cpu-migrations
            51,012        page-faults

      12.369310043 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_double_omp 16
Using 16 threads
Total execution Time in seconds: 12.3981724780
MM execution Time in seconds: 12.2612874920

 Performance counter stats for 'system wide':

            881.48 Joules power/energy-pkg/
   218,063,319,724        instructions              #    0.27  insn per cycle
   822,622,830,036        cycles
            37,959        context-switches
               146        cpu-migrations
            51,553        page-faults

      12.400958939 seconds time elapsed

正如您所看到的,尽管内核参数发生了很大的变化,从 1 微秒变为 1 秒,但结果没有差异。尽管除了 min_capsularity_ns 之外还有其他参数,但这种“无差异”有意义吗?或者这可能不是一个正确的测试程序?


更新 1: 我测试了另一个使用 CBLAS 的实现,它使用 16 线程。如您所见,对于较大的矩阵大小 (20k),IPC 为 1.77,这是可以接受的。同样,通过改变 min_capsularity_ns,尽管上下文切换的数量因大粒度而减少,但时间上没有差异。

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 47.0226452020
MM execution Time in seconds: 37.1756865050

 Performance counter stats for 'system wide':

          3,106.80 Joules power/energy-pkg/
 3,943,151,227,404        instructions              #    1.77  insn per cycle
 2,230,425,316,645        cycles
           273,271        context-switches
               383        cpu-migrations
         2,360,017        page-faults

      47.272118708 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 46.8930790700
MM execution Time in seconds: 37.0639640210

 Performance counter stats for 'system wide':

          3,080.33 Joules power/energy-pkg/
 3,924,979,103,204        instructions              #    1.77  insn per cycle
 2,223,571,579,672        cycles
           125,643        context-switches
               355        cpu-migrations
         2,358,432        page-faults

      47.148148344 seconds time elapsed

我仍然想知道该参数对性能有何影响。

In order to find the effect of kernel parameter, min_granularity_ns, a 16-thread OMP implementation of a matrix multiplication code is launched with high and low values of that parameter. The perf result is shown below:

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_omp 16
Using 16 threads
Total execution Time in seconds: 12.3690895601
MM execution Time in seconds: 12.2312941169

 Performance counter stats for 'system wide':

            911.97 Joules power/energy-pkg/
   218,012,129,383        instructions              #    0.26  insn per cycle
   823,773,717,094        cycles
            37,701        context-switches
               131        cpu-migrations
            51,012        page-faults

      12.369310043 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_double_omp 16
Using 16 threads
Total execution Time in seconds: 12.3981724780
MM execution Time in seconds: 12.2612874920

 Performance counter stats for 'system wide':

            881.48 Joules power/energy-pkg/
   218,063,319,724        instructions              #    0.27  insn per cycle
   822,622,830,036        cycles
            37,959        context-switches
               146        cpu-migrations
            51,553        page-faults

      12.400958939 seconds time elapsed

As you can see there is no difference between the results albeit the large change in the kernel parameter, from 1 us to 1 second. Although there are other parameters in addition to min_granularity_ns, does that "no-difference" make sense? Or maybe this is not a correct program to test?


UPDATE 1: I test another implementation which uses CBLAS and it utilizes 16-threads. As you can see, for a large matrix size (20k), the IPC is 1.77 which is acceptable. Again, by varying the min_granularity_ns, there is no difference in time, although the number of context-swtiches decreases for large granularity.

# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 47.0226452020
MM execution Time in seconds: 37.1756865050

 Performance counter stats for 'system wide':

          3,106.80 Joules power/energy-pkg/
 3,943,151,227,404        instructions              #    1.77  insn per cycle
 2,230,425,316,645        cycles
           273,271        context-switches
               383        cpu-migrations
         2,360,017        page-faults

      47.272118708 seconds time elapsed

# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 46.8930790700
MM execution Time in seconds: 37.0639640210

 Performance counter stats for 'system wide':

          3,080.33 Joules power/energy-pkg/
 3,924,979,103,204        instructions              #    1.77  insn per cycle
 2,223,571,579,672        cycles
           125,643        context-switches
               355        cpu-migrations
         2,358,432        page-faults

      47.148148344 seconds time elapsed

Still I wonder what is the effect of that parameter on the performance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文