Linux 找出超线程核心 ID

发布于 2024-12-02 16:09:19 字数 117 浏览 1 评论 0原文

今天早上我试图找出如何确定哪个处理器 ID 是超线程核心,但没有运气。

我希望找到这些信息并使用 set_affinity() 将进程绑定到超线程线程或非超线程线程以分析其性能。

I spent this morning trying to find out how to determine which processor id is the hyper-threaded core, but without luck.

I wish to find out this information and use set_affinity() to bind a process to hyper-threaded thread or non-hyper-threaded thread to profile its performance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

你穿错了嫁妆 2024-12-09 16:09:19

我发现了做我需要做的事情的简单技巧。

cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list

如果第一个数字等于 CPU 编号(本例中为 0),则它是真正的核心,如果不是,则它是超线程核心。

真实核心示例:

# cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list
1,13

超线程核心示例

# cat /sys/devices/system/cpu/cpu13/topology/thread_siblings_list
1,13

第二个示例的输出与第一个示例完全相同。然而,我们正在检查cpu13,第一个数字是1,所以CPU 13这是一个超线程核心。

I discovered the simply trick to do what I need.

cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list

If the first number is equal to the CPU number (0 in this example) then it's a real core, if not it is a hyperthreading core.

Real core example:

# cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list
1,13

Hyperthreading core example

# cat /sys/devices/system/cpu/cpu13/topology/thread_siblings_list
1,13

The output of the second example is exactly the same as the first one. However we are checking cpu13, and the first number is 1, so CPU 13 this is an hyperthreading core.

春夜浅 2024-12-09 16:09:19

我很惊讶还没有人提到lscpu。以下是具有四个物理核心并启用超线程的单插槽系统的示例:

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,0,0,0,,0,0,0,0
5,1,0,0,,1,1,1,0
6,2,0,0,,2,2,2,0
7,3,0,0,,3,3,3,0

输出解释了如何解释 ID 表;具有相同 Core ID 的逻辑 CPU ID 是同级的。

I'm surprised nobody has mentioned lscpu yet. Here's an example on a single-socket system with four physical cores and hyper-threading enabled:

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,0,0,0,,0,0,0,0
5,1,0,0,,1,1,1,0
6,2,0,0,,2,2,2,0
7,3,0,0,,3,3,3,0

The output explains how to interpret the table of IDs; logical CPU IDs with the same Core ID are siblings.

恋你朝朝暮暮 2024-12-09 16:09:19

HT是对称的(就基本资源而言,系统模式可能是不对称的)。

因此,如果HT打开,物理核心的大量资源将在两个线程之间共享。打开一些额外的硬件来保存两个线程的状态。两个线程都可以对称地访问物理核心。

禁用 HT 的核心和启用 HT 的核心之间存在差异;但启用 HT 的核心的第一半和启用 HT 的核心的第二半之间没有区别。

在某一时刻,一个 HT 线程可能会比其他线程使用更多的资源,但这种资源平衡是动态的。如果两个线程都想要使用相同的资源,CPU 将尽可能地平衡线程。您只能在一个线程中执行 rep noppause,让 CPU 将更多资源分配给其他线程。

我希望找到这些信息并使用 set_affinity() 将进程绑定到超线程线程或非超线程线程以分析其性能。

好吧,您实际上可以在不了解事实的情况下衡量性能。当系统中唯一的线程绑定到 CPU0 时,只需进行分析即可;当它绑定到CPU1时重复此操作。我认为,结果几乎是相同的(如果操作系统将一些中断绑定到 CPU0,它会产生噪音;因此在测试时尝试减少中断数量,并尝试使用 CPU2 和 CPU3(如果有的话)。

PS

Agner(他是 x86 领域的大师)建议使用偶数核心 如果您不想使用 HT,但它已在 BIOS 中启用:

如果检测到超线程,则锁定进程以仅使用偶数逻辑处理器。这将使每个处理器核心中的两个线程之一空闲,从而不存在资源争用。

PPS 关于新轮回 HT(不是 P4 的,而是 Nehalem 和 Sandy) - 基于 Agner 对微架构的研究

Sandy Bridge 中需要注意的新瓶颈如下:
...
5.线程间资源共享。许多关键资源是共享的
当超线程开启时,核心的两个线程之间。转向可能是明智的
当多个线程依赖于相同的执行资源时关闭超线程。

...

在 NetBurst 中引入了一种中途解决方案,并在 Nehalem 和 Sandy Bridge 中再次引入了所谓的
超线程技术。超线程处理器有两个逻辑处理器
共享相同的执行核心。如果两个线程
竞争相同的资源,但如果
性能受到其他因素的限制,例如内存访问。

...

英特尔和 AMD 都在开发混合解决方案,其中部分或全部
执行单元在两个处理器核心之间共享(Intel 中的超线程)
术语)。

PPPS:Intel Optimization书籍列出了第二代HT中的资源共享:(第93页,此列表适用于nehalem,但Sandy部分中此列表没有变化)

更深的缓冲和增强的资源共享/分区策略:

  • — HT 操作的复制资源:寄存器状态,重命名返回堆栈
    buffer, large-page ITLB //我的评论:这个HW有2组
  • ——用于HT操作的分区资源:加载缓冲区、存储缓冲区、重新排序
    缓冲区、小页 ITLB 静态分配在两个逻辑之间
    处理器。 // 我的评论:这个硬件只有一套;它被静态地分为两半的两个 HT 虚拟核心
  • — HT 操作期间竞争性共享资源:保留站,
    高速缓存层次结构、填充缓冲区、DTLB0 和 STLB。 // 注释:单组,但不分成两半。 CPU会动态地重新分配资源。
  • — HT运行期间交替:前端运行一般交替进行
    两个逻辑处理器之间以确保公平性。 // 注释:有单个前端(指令解码器),因此线程将按顺序解码:1,2,1,2。
  • — HT 不感知资源:执行单元。 // 注释:有实际的硬件设备可以进行计算、内存访问。只有单套。如果其中一个线程能够使用大量执行单元,并且内存等待次数较少,它将消耗所有执行单元,第二个线程性能将会很低(但 HT 有时会切换到第二个线程。多久一次? ??)。如果两个线程都没有进行深度优化和/或有内存等待,则执行单元将在两个线程之间分割。

第112页也有图片(图2-13),显示两个逻辑核心是对称的。

HT 技术的性能潜力归因于:

  • • 操作系统和用户程序可以调度进程或
    线程在每个物理处理器的逻辑处理器上同时执行
    处理器
  • • 能够在比仅使用一个处理器时更高的级别上使用片上执行资源
    单线程正在消耗执行资源;更高级别的资源
    利用率可以带来更高的系统吞吐量

虽然来自两个程序或两个线程的指令同时执行
并且不一定按照执行核心和内存层次结构中的程序顺序,
前端和后端包含多个选择点以供选择
来自两个逻辑处理器的指令。所有选择点之间交替
两个逻辑处理器,除非一个逻辑处理器无法使用管道
阶段。在这种情况下,另一个逻辑处理器可以充分利用管道的每个周期
阶段。逻辑处理器不使用流水线级的原因包括
缓存未命中、分支错误预测和指令依赖性。

HT is symmetric (in terms of basic resources, the system-mode may be asymmetric).

So, if the HT is turned on, large resources of Physical core will be shared between two threads. Some additional hardware is turned on to save state of both threads. Both threads have symmetric access to physical core.

There is a difference between HT-disabled core and HT-enabled core; but no difference between 1st half of HT-enabled core and 2nd half of HT-enabled core.

At single moment of time, one HT-thread may use more resources than other, but this resource balancing is dynamic. CPU will balance threads as it can and as it wants if both threads want to use the same resource. You can only do a rep nop or pause in one thread to let CPU give more resources to other thread.

I wish to find out this information and use set_affinity() to bind a process to hyper-threaded thread or non-hyper-threaded thread to profile its performance.

Okay, you actually can measure performance without knowing a fact. Just do a profile when the only thread in system is binded to CPU0; and repeat it when it is binded to CPU1. I think, the results will be almost the same (OS can generate noise if it binds some interrupts to CPU0; so try to lower number of interrupts when do testing and try to use CPU2 and CPU3 if you have such).

PS

Agner (he is the Guru in x86) recommends to use even-numbered cores in the case when you want not to use HT, but it is enabled in BIOS:

If hyperthreading is detected then lock the process to use the even-numbered logical processors only. This will make one of the two threads in each processor core idle so that there is no contention for resources.

PPS About New-reincarnation HT (not a P4 one, but Nehalem and Sandy) - based on Agner's research on microarchitecture

The new bottlenecks that require attention in the Sandy Bridge are the following:
...
5. Sharing of resources between threads. Many of the critical resources are shared
between the two threads of a core when hyperthreading is on. It may be wise to turn
off hyperthreading when multiple threads depend on the same execution resources.

...

A half-way solution was introduced in the NetBurst and again in the Nehalem and Sandy Bridge with the so-called
hyperthreading technology. The hyperthreading processor has two logical processors
sharing the same execution core. The advantage of this is limited if the two threads
compete for the same resources, but hyperthreading can be quite advantageous if the
performance is limited by something else, such as memory access.

...

Both Intel and AMD are making hybrid solutions where some or all of the
execution units are shared between two processor cores (hyperthreading in Intel
terminology).

PPPS: Intel Optimization book lists resource sharing in second-generation HT: (page 93, this list is for nehalem, but there is no changes of this list in Sandy section)

Deeper buffering and enhanced resource sharing/partition policies:

  • — Replicated resource for HT operation: register state, renamed return stack
    buffer, large-page ITLB //comment by me: there are 2 sets of this HW
  • — Partitioned resources for HT operation: load buffers, store buffers, re-order
    buffers, small-page ITLB are statically allocated between two logical
    processors. // comment by me: there is single set of this HW; it is statically splitted between two HT-virtual cores in two halfs
  • — Competitively-shared resource during HT operation: the reservation station,
    cache hierarchy, fill buffers, both DTLB0 and STLB. // comment: Single set, but divided not in half. CPU will dynamically redivide resources.
  • — Alternating during HT operation: front-end operation generally alternates
    between two logical processors to ensure fairness. // comment: there is single Frontend (instruction decoder), so threads will be decoded in order: 1, 2, 1, 2.
  • — HT unaware resources: execution units. // comment: there are actual hw devices which will do computations, memory accesses. There is only single set. If one of threads is capable of using a lot of execution units and if it has a low number of memory waits, it will consume all exec units and second thread performance will be low (but HT will switch sometimes to second thread. How often??? ). If both threads are not heavy-optimized and/or have memory waits, execution units will be splitted between two threads.

There are also pictures at page 112 (Figure 2-13), which shows that both logical cores are symmetric.

The performance potential due to HT Technology is due to:

  • • The fact that operating systems and user programs can schedule processes or
    threads to execute simultaneously on the logical processors in each physical
    processor
  • • The ability to use on-chip execution resources at a higher level than when only a
    single thread is consuming the execution resources; higher level of resource
    utilization can lead to higher system throughput

Although instructions originating from two programs or two threads execute simultaneously
and not necessarily in program order in the execution core and memory hierarchy,
the front end and back end contain several selection points to select between
instructions from the two logical processors. All selection points alternate between
the two logical processors unless one logical processor cannot make use of a pipeline
stage. In this case, the other logical processor has full use of every cycle of the pipeline
stage. Reasons why a logical processor may not use a pipeline stage include
cache misses, branch mispredictions, and instruction dependencies.

虚拟世界 2024-12-09 16:09:19

OpenMPI 项目有通用 (Linux/Windows) 和便携式硬件拓扑检测器(内核、HT、cacahes、南桥和磁盘/网络连接局部性) - hwloc。你可以使用它,因为linux可能使用不同的HT核心编号规则,我们无法知道它是偶数/奇数还是y和y+8编号规则。

hwloc主页:
http://www.open-mpi.org/projects/hwloc/

下载页面:
http://www.open-mpi.org/software/hwloc/v1。 10/

说明:

可移植硬件局部性 (hwloc) 软件包提供了现代架构分层拓扑的可移植抽象(跨操作系统、版本、架构等),包括 NUMA 内存节点、套接字、共享缓存、内核和同时多线程。它还收集各种系统属性,例如缓存和内存信息,以及 I/O 设备的位置,例如网络接口、InfiniBand HCA 或 GPU。它的主要目的是帮助应用程序收集有关现代计算硬件的信息,以便相应且有效地利用它。

它有 lstopo 命令以图形形式获取硬件拓扑,例如

 ubuntu$ sudo apt-get hwloc
 ubuntu$ lstopo

lstopo from hwloc (OpenMPI) - 输出示例

或以文本形式:

 ubuntu$ sudo apt-get hwloc-nox
 ubuntu$ lstopo --of console

我们可以将物理内核视为 Core L#x,每个内核都有两个逻辑内核 PU L#yPU L#y+8

Machine (16GB)
  Socket L#0 + L3 L#0 (4096KB)
    L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#8)
    L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1
      PU L#2 (P#4)
      PU L#3 (P#12)
  Socket L#1 + L3 L#1 (4096KB)
    L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2
      PU L#4 (P#1)
      PU L#5 (P#9)
    L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3
      PU L#6 (P#5)
      PU L#7 (P#13)
  Socket L#2 + L3 L#2 (4096KB)
    L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4
      PU L#8 (P#2)
      PU L#9 (P#10)
    L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5
      PU L#10 (P#6)
      PU L#11 (P#14)
  Socket L#3 + L3 L#3 (4096KB)
    L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6
      PU L#12 (P#3)
      PU L#13 (P#11)
    L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7
      PU L#14 (P#7)
      PU L#15 (P#15)

There is universal (Linux/Windows) and portable HW topology detector (cores, HT, cacahes, south bridges and disk/net connection locality) - hwloc by OpenMPI project. You may use it, because linux may use different HT core numbering rules, and we can't know will it be even/odd or y and y+8 nubering rule.

Home page of hwloc:
http://www.open-mpi.org/projects/hwloc/

Download page:
http://www.open-mpi.org/software/hwloc/v1.10/

Description:

The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs. It primarily aims at helping applications with gathering information about modern computing hardware so as to exploit it accordingly and efficiently.

It has lstopo command to get hw topology in graphic form like

 ubuntu$ sudo apt-get hwloc
 ubuntu$ lstopo

lstopo from hwloc (OpenMPI) - output example

or in text form:

 ubuntu$ sudo apt-get hwloc-nox
 ubuntu$ lstopo --of console

We can see physical cores as Core L#x each having two logical cores PU L#y and PU L#y+8.

Machine (16GB)
  Socket L#0 + L3 L#0 (4096KB)
    L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#8)
    L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1
      PU L#2 (P#4)
      PU L#3 (P#12)
  Socket L#1 + L3 L#1 (4096KB)
    L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2
      PU L#4 (P#1)
      PU L#5 (P#9)
    L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3
      PU L#6 (P#5)
      PU L#7 (P#13)
  Socket L#2 + L3 L#2 (4096KB)
    L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4
      PU L#8 (P#2)
      PU L#9 (P#10)
    L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5
      PU L#10 (P#6)
      PU L#11 (P#14)
  Socket L#3 + L3 L#3 (4096KB)
    L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6
      PU L#12 (P#3)
      PU L#13 (P#11)
    L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7
      PU L#14 (P#7)
      PU L#15 (P#15)
夜深人未静 2024-12-09 16:09:19

在 bash 中获取 cpu 核心的超线程同级的简单方法:

cat $(find /sys/devices/system/cpu -regex ".*cpu[0-9]+/topology/thread_siblings_list") | sort -n | uniq

还有 lscpu -e 它将提供相关的核心和 cpu 信息:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       yes    4100.0000 400.0000
1   0    0      1    1:1:1:0       yes    4100.0000 400.0000
2   0    0      2    2:2:2:0       yes    4100.0000 400.0000
3   0    0      3    3:3:3:0       yes    4100.0000 400.0000
4   0    0      0    0:0:0:0       yes    4100.0000 400.0000
5   0    0      1    1:1:1:0       yes    4100.0000 400.0000
6   0    0      2    2:2:2:0       yes    4100.0000 400.0000
7   0    0      3    3:3:3:0       yes    4100.0000 400.0000

Simple way to get hyperthreading siblings of cpu cores in bash:

cat $(find /sys/devices/system/cpu -regex ".*cpu[0-9]+/topology/thread_siblings_list") | sort -n | uniq

There's also lscpu -e which will give relevant core and cpu info:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       yes    4100.0000 400.0000
1   0    0      1    1:1:1:0       yes    4100.0000 400.0000
2   0    0      2    2:2:2:0       yes    4100.0000 400.0000
3   0    0      3    3:3:3:0       yes    4100.0000 400.0000
4   0    0      0    0:0:0:0       yes    4100.0000 400.0000
5   0    0      1    1:1:1:0       yes    4100.0000 400.0000
6   0    0      2    2:2:2:0       yes    4100.0000 400.0000
7   0    0      3    3:3:3:0       yes    4100.0000 400.0000
擦肩而过的背影 2024-12-09 16:09:19

我尝试通过比较核心温度和 HT 核心负载来验证该信息。

在此处输入图像描述

I tried to verify the information by comparing the temperature of the core and load on the HT core.

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文