为什么相同的任务在Linux内核4.9和5.4＆＃xff1f上的成本不同。

发布于 2025-01-28 04:10:59 字数 6154 浏览 4 评论 0原文

我的应用程序是计算密集任务（即视频编码）。当它在Linux内核4.9（Ubuntu 16.04）上运行时，CPU使用率为3300％。但是，当它在Linux内核5.4（Ubuntu 20.04）上运行时，CPU使用率仅为2850％。承诺这些过程也可以做同样的工作。

因此，我想知道Linux内核是在4.9和5.4之间进行了一些CPU计划优化或相关工作？您能提供任何建议来调查原因吗？

对于您的信息，我不确定Glic版本是否有效，Linux内核4.9上的Glic版本为2.23，而Linux内核5.4上的2.31版本为2.31。

CPU Info:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping:              7
CPU MHz:               2200.000
BogoMIPS:              4401.69
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              14080K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Output of perf stat on Linux Kernel 4.9

 Performance counter stats for process id '32504':

    3146297.833447      cpu-clock (msec)          #   32.906 CPUs utilized          
         1,718,778      context-switches          #    0.546 K/sec                  
           574,717      cpu-migrations            #    0.183 K/sec                  
         2,796,706      page-faults               #    0.889 K/sec                  
 6,193,409,215,015      cycles                    #    1.968 GHz                      (30.76%)
 6,948,575,328,419      instructions              #    1.12  insn per cycle           (38.47%)
   540,538,530,660      branches                  #  171.801 M/sec                    (38.47%)
    33,087,740,169      branch-misses             #    6.12% of all branches          (38.50%)
 1,966,141,393,632      L1-dcache-loads           #  624.906 M/sec                    (38.49%)
   184,477,765,497      L1-dcache-load-misses     #    9.38% of all L1-dcache hits    (38.47%)
     8,324,742,443      LLC-loads                 #    2.646 M/sec                    (30.78%)
     3,835,471,095      LLC-load-misses           #   92.15% of all LL-cache hits     (30.76%)
   <not supported>      L1-icache-loads                                             
   187,604,831,388      L1-icache-load-misses                                         (30.78%)
 1,965,198,121,190      dTLB-loads                #  624.607 M/sec                    (30.81%)
       438,496,889      dTLB-load-misses          #    0.02% of all dTLB cache hits   (30.79%)
     7,139,892,384      iTLB-loads                #    2.269 M/sec                    (30.79%)
       260,660,265      iTLB-load-misses          #    3.65% of all iTLB cache hits   (30.77%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      95.615072142 seconds time elapsed

Output of perf stat on Linux Kernel 5.4

 Performance counter stats for process id '3355137':

      2,718,192.32 msec cpu-clock                 #   29.184 CPUs utilized          
         1,719,910      context-switches          #    0.633 K/sec                  
           448,685      cpu-migrations            #    0.165 K/sec                  
         3,884,586      page-faults               #    0.001 M/sec                  
 5,927,930,305,757      cycles                    #    2.181 GHz                      (30.77%)
 6,848,723,995,972      instructions              #    1.16  insn per cycle           (38.47%)
   536,856,379,853      branches                  #  197.505 M/sec                    (38.47%)
    32,245,288,271      branch-misses             #    6.01% of all branches          (38.48%)
 1,935,640,517,821      L1-dcache-loads           #  712.106 M/sec                    (38.47%)
   177,978,528,204      L1-dcache-load-misses     #    9.19% of all L1-dcache hits    (38.49%)
     8,119,842,688      LLC-loads                 #    2.987 M/sec                    (30.77%)
     3,625,986,107      LLC-load-misses           #   44.66% of all LL-cache hits     (30.75%)
   <not supported>      L1-icache-loads                                             
   184,001,558,310      L1-icache-load-misses                                         (30.76%)
 1,934,701,161,746      dTLB-loads                #  711.760 M/sec                    (30.74%)
       676,618,636      dTLB-load-misses          #    0.03% of all dTLB cache hits   (30.76%)
     6,275,901,454      iTLB-loads                #    2.309 M/sec                    (30.78%)
       391,706,425      iTLB-load-misses          #    6.24% of all iTLB cache hits   (30.78%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      93.139551411 seconds time elapsed

更新：

确认性能增益来自Linux内核5.4，因为Linux内核5.3上的性能与Linux内核4.9相同。
确认性能增益与LIBC没有关系，因为在Linux内核5.10上，其LIBC为2.23性能与Linux内核5.4相同，其LIBC为2.31

原文

My application is a compute intensive task(I.e. video encoding). When it is running on linux kernel 4.9(Ubuntu 16.04), the cpu usage is 3300%. But when it is running on linux kernel 5.4(Ubuntu 20.04), the cpu Usage is just 2850%. Promise the processes do the same job.

So I wonder if linux kernel had done some cpu scheduling optimization or related work between 4.9 and 5.4? Could you give any advice to investigate the reason?

I am not sure if the version of glic has effect or not, for your information, the version of glic is 2.23 on linux kernel 4.9 while 2.31 on linux kernel 5.4.

CPU Info:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping:              7
CPU MHz:               2200.000
BogoMIPS:              4401.69
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              14080K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Output of perf stat on Linux Kernel 4.9

 Performance counter stats for process id '32504':

    3146297.833447      cpu-clock (msec)          #   32.906 CPUs utilized          
         1,718,778      context-switches          #    0.546 K/sec                  
           574,717      cpu-migrations            #    0.183 K/sec                  
         2,796,706      page-faults               #    0.889 K/sec                  
 6,193,409,215,015      cycles                    #    1.968 GHz                      (30.76%)
 6,948,575,328,419      instructions              #    1.12  insn per cycle           (38.47%)
   540,538,530,660      branches                  #  171.801 M/sec                    (38.47%)
    33,087,740,169      branch-misses             #    6.12% of all branches          (38.50%)
 1,966,141,393,632      L1-dcache-loads           #  624.906 M/sec                    (38.49%)
   184,477,765,497      L1-dcache-load-misses     #    9.38% of all L1-dcache hits    (38.47%)
     8,324,742,443      LLC-loads                 #    2.646 M/sec                    (30.78%)
     3,835,471,095      LLC-load-misses           #   92.15% of all LL-cache hits     (30.76%)
   <not supported>      L1-icache-loads                                             
   187,604,831,388      L1-icache-load-misses                                         (30.78%)
 1,965,198,121,190      dTLB-loads                #  624.607 M/sec                    (30.81%)
       438,496,889      dTLB-load-misses          #    0.02% of all dTLB cache hits   (30.79%)
     7,139,892,384      iTLB-loads                #    2.269 M/sec                    (30.79%)
       260,660,265      iTLB-load-misses          #    3.65% of all iTLB cache hits   (30.77%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      95.615072142 seconds time elapsed

Output of perf stat on Linux Kernel 5.4

 Performance counter stats for process id '3355137':

      2,718,192.32 msec cpu-clock                 #   29.184 CPUs utilized          
         1,719,910      context-switches          #    0.633 K/sec                  
           448,685      cpu-migrations            #    0.165 K/sec                  
         3,884,586      page-faults               #    0.001 M/sec                  
 5,927,930,305,757      cycles                    #    2.181 GHz                      (30.77%)
 6,848,723,995,972      instructions              #    1.16  insn per cycle           (38.47%)
   536,856,379,853      branches                  #  197.505 M/sec                    (38.47%)
    32,245,288,271      branch-misses             #    6.01% of all branches          (38.48%)
 1,935,640,517,821      L1-dcache-loads           #  712.106 M/sec                    (38.47%)
   177,978,528,204      L1-dcache-load-misses     #    9.19% of all L1-dcache hits    (38.49%)
     8,119,842,688      LLC-loads                 #    2.987 M/sec                    (30.77%)
     3,625,986,107      LLC-load-misses           #   44.66% of all LL-cache hits     (30.75%)
   <not supported>      L1-icache-loads                                             
   184,001,558,310      L1-icache-load-misses                                         (30.76%)
 1,934,701,161,746      dTLB-loads                #  711.760 M/sec                    (30.74%)
       676,618,636      dTLB-load-misses          #    0.03% of all dTLB cache hits   (30.76%)
     6,275,901,454      iTLB-loads                #    2.309 M/sec                    (30.78%)
       391,706,425      iTLB-load-misses          #    6.24% of all iTLB cache hits   (30.78%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      93.139551411 seconds time elapsed

UPDATE:

It is confirmed the performance gain comes from linux kernel 5.4, because the performance on linux kernel 5.3 is the same as linux kernel 4.9.
It is confirmed the performance gain has no relation with libc, because on linux kernel 5.10 whose libc is 2.23 the performance is the same as linux kernel 5.4 whose libc is 2.31

分享到QQ

分享到微博