min_granularity_ns对性能的影响
为了找出内核参数min_capsularity_ns
的影响,矩阵乘法的16线程OMP实现code 使用该参数的高值和低值启动。性能结果如下所示:
# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_omp 16
Using 16 threads
Total execution Time in seconds: 12.3690895601
MM execution Time in seconds: 12.2312941169
Performance counter stats for 'system wide':
911.97 Joules power/energy-pkg/
218,012,129,383 instructions # 0.26 insn per cycle
823,773,717,094 cycles
37,701 context-switches
131 cpu-migrations
51,012 page-faults
12.369310043 seconds time elapsed
# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_double_omp 16
Using 16 threads
Total execution Time in seconds: 12.3981724780
MM execution Time in seconds: 12.2612874920
Performance counter stats for 'system wide':
881.48 Joules power/energy-pkg/
218,063,319,724 instructions # 0.27 insn per cycle
822,622,830,036 cycles
37,959 context-switches
146 cpu-migrations
51,553 page-faults
12.400958939 seconds time elapsed
正如您所看到的,尽管内核参数发生了很大的变化,从 1 微秒变为 1 秒,但结果没有差异。尽管除了 min_capsularity_ns
之外还有其他参数,但这种“无差异”有意义吗?或者这可能不是一个正确的测试程序?
更新 1: 我测试了另一个使用 CBLAS 的实现,它使用 16 线程。如您所见,对于较大的矩阵大小 (20k),IPC 为 1.77,这是可以接受的。同样,通过改变 min_capsularity_ns
,尽管上下文切换的数量因大粒度而减少,但时间上没有差异。
# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 47.0226452020
MM execution Time in seconds: 37.1756865050
Performance counter stats for 'system wide':
3,106.80 Joules power/energy-pkg/
3,943,151,227,404 instructions # 1.77 insn per cycle
2,230,425,316,645 cycles
273,271 context-switches
383 cpu-migrations
2,360,017 page-faults
47.272118708 seconds time elapsed
# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 46.8930790700
MM execution Time in seconds: 37.0639640210
Performance counter stats for 'system wide':
3,080.33 Joules power/energy-pkg/
3,924,979,103,204 instructions # 1.77 insn per cycle
2,223,571,579,672 cycles
125,643 context-switches
355 cpu-migrations
2,358,432 page-faults
47.148148344 seconds time elapsed
我仍然想知道该参数对性能有何影响。
In order to find the effect of kernel parameter, min_granularity_ns
, a 16-thread OMP implementation of a matrix multiplication code is launched with high and low values of that parameter. The perf result is shown below:
# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_omp 16
Using 16 threads
Total execution Time in seconds: 12.3690895601
MM execution Time in seconds: 12.2312941169
Performance counter stats for 'system wide':
911.97 Joules power/energy-pkg/
218,012,129,383 instructions # 0.26 insn per cycle
823,773,717,094 cycles
37,701 context-switches
131 cpu-migrations
51,012 page-faults
12.369310043 seconds time elapsed
# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_double_omp 16
Using 16 threads
Total execution Time in seconds: 12.3981724780
MM execution Time in seconds: 12.2612874920
Performance counter stats for 'system wide':
881.48 Joules power/energy-pkg/
218,063,319,724 instructions # 0.27 insn per cycle
822,622,830,036 cycles
37,959 context-switches
146 cpu-migrations
51,553 page-faults
12.400958939 seconds time elapsed
As you can see there is no difference between the results albeit the large change in the kernel parameter, from 1 us to 1 second. Although there are other parameters in addition to min_granularity_ns
, does that "no-difference" make sense? Or maybe this is not a correct program to test?
UPDATE 1: I test another implementation which uses CBLAS and it utilizes 16-threads. As you can see, for a large matrix size (20k), the IPC is 1.77 which is acceptable. Again, by varying the min_granularity_ns
, there is no difference in time, although the number of context-swtiches decreases for large granularity.
# echo 1000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 47.0226452020
MM execution Time in seconds: 37.1756865050
Performance counter stats for 'system wide':
3,106.80 Joules power/energy-pkg/
3,943,151,227,404 instructions # 1.77 insn per cycle
2,230,425,316,645 cycles
273,271 context-switches
383 cpu-migrations
2,360,017 page-faults
47.272118708 seconds time elapsed
# echo 1000000000 > /sys/kernel/debug/sched/min_granularity_ns
# perf stat -a -e $EVENTS -- ./mm_blas
Total execution Time in seconds: 46.8930790700
MM execution Time in seconds: 37.0639640210
Performance counter stats for 'system wide':
3,080.33 Joules power/energy-pkg/
3,924,979,103,204 instructions # 1.77 insn per cycle
2,223,571,579,672 cycles
125,643 context-switches
355 cpu-migrations
2,358,432 page-faults
47.148148344 seconds time elapsed
Still I wonder what is the effect of that parameter on the performance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论