如何监视Linux过程中过程的NUMA互连(QPI/UPI)的带宽使用情况?

发布于 2025-01-24 02:10:41 字数 205 浏览 0 评论 0原文

如何测量一个过程在Linux中使用两个NUMA节点之间使用的QPI/UPI带宽多少?

假设我的过程在numa节点0上有一个线程,而在numa节点1上的另一个线程每个线程在其他numa节点上访问QPI/UPI带宽的数据。如何衡量此带宽使用情况?

我有一台带有2倍Intel Skylake处理器的机器,可以用来使用UPI技术,但我认为QPI的解决方案也相同(不确定!)。

How can I measure how much QPI/UPI bandwidth a process is using between two NUMA nodes in Linux?

Let's say my process has a thread on NUMA node 0 and another thread on NUMA node 1 each accessing their data on the other NUMA node throttling the QPI/UPI bandwidth. How to measure this bandwidth usage?

I have a machine with 2x Intel skylake processors which use UPI technology but I think the solutions would be the same for QPI as well (not sure!).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

携君以终年 2025-01-31 02:10:41

您想测量由两个非统一内存访问(NUMA)节点(又称“远程内存访问”或“ numa访问”)之间的内存访问生成的流量(带宽)。当处理器需要访问存储在由不同处理器管理的内存中的数据时,使用了点对点处理器互连,例如Intel Ultra Path InterConnect(UPI)。

为特定过程/线程收集UPI(或QPI)带宽可能会变得棘手。

peri upi链接带宽(CPU插座粒度)

处理器计数器监视器( pcm )提供了许多命令 -实时监控的线路实用程序。例如,PCM二进制显示每个插座UPI流量估计。根据所需的精度(以及其他过程生成的NUMA流量),可能足以理解UPI链接是否饱和。

英特尔内存延迟检查器(MLC)可用作工作负载,以检查PCM在两个NUMA节点之间创建最大流量时的行为。

例如,使用./ mlc -bandwidth_matrix -t15(在远程访问阶段)中生成的工作负载,PCM使用我的2台(Intel Cascade Lake)服务器显示以下内容:

Intel(r) UPI data traffic estimation in bytes (data traffic coming to CPU/socket through UPI links):

               UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
---------------------------------------------------------------------------------------------------------------
 SKT    0       17 G     17 G      0     |   73%    73%     0%  
 SKT    1     6978 K   7184 K      0     |    0%     0%     0%  
---------------------------------------------------------------------------------------------------------------
Total UPI incoming data traffic:   34 G     UPI data traffic/Memory controller traffic: 0.96

Intel(r) UPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through UPI links):

               UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
---------------------------------------------------------------------------------------------------------------
 SKT    0     8475 M   8471 M      0     |   35%    35%     0%  
 SKT    1       21 G     21 G      0     |   91%    91%     0%  
---------------------------------------------------------------------------------------------------------------
Total UPI outgoing data and non-data traffic:   59 G
MEM (GB)->|  READ |  WRITE | LOCAL | PMM RD | PMM WR | CPU energy | DIMM energy | LLCRDMISSLAT (ns) UncFREQ (Ghz)
---------------------------------------------------------------------------------------------------------------
 SKT   0     0.19     0.05   92 %      0.00      0.00      87.58      13.28         582.98 2.38
 SKT   1    36.16     0.01    0 %      0.00      0.00      66.82      21.86         9698.13 2.40
---------------------------------------------------------------------------------------------------------------
       *    36.35     0.06    0 %      0.00      0.00     154.40      35.14         585.67 2.39

监视NUMA流量(CPU核心粒度)

PCM还显示MB/s的每个核心远程流量(IE NUMA流量)。请参阅RMB列:

rmb:l3缓存外部带宽通过远程内存满足(在
mbytes)

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |   L3OCC |   LMB  |   RMB  | TEMP

   0    0     0.04   0.04   1.00    1.00    1720 K   1787 K    0.04    0.55  0.0167  0.0173      800        1      777     49
   1    0     0.04   0.04   1.00    1.00    1750 K   1816 K    0.04    0.55  0.0171  0.0177      640        5      776     50
   2    0     0.04   0.04   1.00    1.00    1739 K   1828 K    0.05    0.55  0.0169  0.0178      720        0      777     50
   3    0     0.04   0.04   1.00    1.00    1721 K   1800 K    0.04    0.55  0.0168  0.0175      240        0      784     51
<snip>
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.04   0.04   1.00    1.00      68 M     71 M    0.04    0.55  0.0168  0.0175    26800        8    31632     48
 SKT    1     0.02   0.88   0.03    1.00      66 K   1106 K    0.94    0.13  0.0000  0.0005    25920        4       15     52
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.03   0.06   0.51    1.00      68 M     72 M    0.05    0.54  0.0107  0.0113     N/A     N/A     N/A      N/A

每个核心远程流量可用于收集线程级数字流量。

估计线程之间生成的numa吞吐量

的方法

  1. 您需要确保生成Numa流量的线程绑定到专用内核。可以通过编程方式完成,也可以使用HWLOC-BIND等工具重新命名线程。

  2. 确保其他过程绑定到不同的CPU核心(例如 cpusanitizer 可能会在定期扫描可能很有用,所有过程并修改其CPU核心亲和力)。注意:注意超线程。您不希望您需要监视与其他过程共享相同CPU内核的线程。

  3. 检查您要监视的线程的内核上生成的远程流量(PCM RMB列)。

You want to measure the traffic (bandwidth) generated by memory accesses between two Non-Uniform Memory Access (NUMA) nodes (aka 'remote memory accesses' or 'NUMA accesses'). When a processor needs to access data which is stored in a memory managed by a different processor, a point-to-point processor interconnect like the Intel Ultra Path Interconnect (UPI) is utilized.

Collecting the UPI (or QPI) bandwidth for a specific process/thread can get tricky.

Per UPI link bandwidth (the CPU socket granularity)

Processor Counter Monitor (PCM) provides a number of command-line utilities for real-time monitoring. For instance the pcm binary displays a per socket UPI traffic estimation. Depending on the required precision (and NUMA traffic generated by other processes), it might be enough to understand if the UPI links are saturated.

Intel Memory Latency Checker (MLC) can be used as a workload to check how PCM behaves when creating a maximum traffic between two NUMA nodes.

For instance, using the workload generated by ./mlc --bandwidth_matrix -t15 (during a remote access phase), PCM displays the following with my 2-socket (Intel Cascade Lake) server node:

Intel(r) UPI data traffic estimation in bytes (data traffic coming to CPU/socket through UPI links):

               UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
---------------------------------------------------------------------------------------------------------------
 SKT    0       17 G     17 G      0     |   73%    73%     0%  
 SKT    1     6978 K   7184 K      0     |    0%     0%     0%  
---------------------------------------------------------------------------------------------------------------
Total UPI incoming data traffic:   34 G     UPI data traffic/Memory controller traffic: 0.96

Intel(r) UPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through UPI links):

               UPI0     UPI1     UPI2    |  UPI0   UPI1   UPI2  
---------------------------------------------------------------------------------------------------------------
 SKT    0     8475 M   8471 M      0     |   35%    35%     0%  
 SKT    1       21 G     21 G      0     |   91%    91%     0%  
---------------------------------------------------------------------------------------------------------------
Total UPI outgoing data and non-data traffic:   59 G
MEM (GB)->|  READ |  WRITE | LOCAL | PMM RD | PMM WR | CPU energy | DIMM energy | LLCRDMISSLAT (ns) UncFREQ (Ghz)
---------------------------------------------------------------------------------------------------------------
 SKT   0     0.19     0.05   92 %      0.00      0.00      87.58      13.28         582.98 2.38
 SKT   1    36.16     0.01    0 %      0.00      0.00      66.82      21.86         9698.13 2.40
---------------------------------------------------------------------------------------------------------------
       *    36.35     0.06    0 %      0.00      0.00     154.40      35.14         585.67 2.39

Monitoring the NUMA traffic (CPU core granularity)

PCM also displays per core remote traffic in MB/s (i.e. NUMA traffic). See RMB column:

RMB : L3 cache external bandwidth satisfied by remote memory (in
MBytes)

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |   L3OCC |   LMB  |   RMB  | TEMP

   0    0     0.04   0.04   1.00    1.00    1720 K   1787 K    0.04    0.55  0.0167  0.0173      800        1      777     49
   1    0     0.04   0.04   1.00    1.00    1750 K   1816 K    0.04    0.55  0.0171  0.0177      640        5      776     50
   2    0     0.04   0.04   1.00    1.00    1739 K   1828 K    0.05    0.55  0.0169  0.0178      720        0      777     50
   3    0     0.04   0.04   1.00    1.00    1721 K   1800 K    0.04    0.55  0.0168  0.0175      240        0      784     51
<snip>
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.04   0.04   1.00    1.00      68 M     71 M    0.04    0.55  0.0168  0.0175    26800        8    31632     48
 SKT    1     0.02   0.88   0.03    1.00      66 K   1106 K    0.94    0.13  0.0000  0.0005    25920        4       15     52
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.03   0.06   0.51    1.00      68 M     72 M    0.05    0.54  0.0107  0.0113     N/A     N/A     N/A      N/A

The per core remote traffic can be used to gather thread level NUMA traffic.

Method to estimate NUMA throughput generated between threads

  1. You need to ensure that the threads generating the NUMA traffic are bound to dedicated cores. That can be done programmatically or you can rebind the threads by using tools like hwloc-bind.

  2. Ensure other processes are bound to different cpu cores (scripts like cpusanitizer might be useful to periodically scan all processes and modify their CPU core affinity). Note: pay attention to the hyperthreads. You don't want that the threads you need to monitor share the same CPU cores with other processes.

  3. Check the remote traffic (PCM RMB column) generated on the cores on which you attached the threads you want to monitor.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文