说明为什么在添加CPU时有效的DRAM带宽会减少

发布于 2025-01-28 13:04:12 字数 662 浏览 3 评论 0 原文

这个问题是此处发布的一个问题:测量CCNUMA系统上的带宽/a>

我已经为CCNUMA系统上的记忆带宽编写了一个微基准测试,该系统具有2x Intel(R)Xeon(R)Platinum 8168:

  1. 24核心 @ 2.70 GHz,
  2. L1 CACHE 32 KBE 32 KB,L2 CACE 1 MB和L3 CACHE和L3 CACHE 33 MB。

作为参考,我正在使用英特尔顾问的屋顶线图,该图描述了每个CPU数据路径的带宽。据此,带宽为230 GB/s。

带宽的强尺度:

问题:如果您查看强缩放图,则可以看到在33 CPU下实际上实现了峰值有效带宽,然后添加CPU只会减少它。为什么会发生这种情况?

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system

I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:

  1. 24 cores @ 2.70 GHz,
  2. L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.

As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.

Strong scaling of bandwidth:
enter image description here

Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

楠木可依 2025-02-04 13:04:13

概述

此答案提供了可能的解释。简而言之,所有并行的工作负载均不会无限扩展。当许多核心争夺相同的共享资源(例如DRAM)时,使用过多的核心通常会有害,因为有足够的核心可以使给定的共享资源饱和,并且使用更多的核心只会增加开销< /strong>。

更具体地说,在您的情况下,L3缓存和IMC可能是问题所在。启用 sub-numa聚类非时空预摘要应提高基准测试的性能和可扩展性。尽管如此,还有其他建筑硬件限制可能会导致基准不能很好地扩展。下一节介绍了英特尔Skylake SP处理器如何处理内存访问以及如何找到瓶颈。


在您的情况下

,英特尔Xeon Skylake SP处理器的布局就像以下内容:

“
来源: Intel

有两个插座与UPI互连连接,每个处理器都连接到其自己的DRAM集合。每个处理器有2个集成存储器控制器(IMC),每个内存控制器(IMC)连接到3 ddr4 dram @ 2666MHz。这意味着理论带宽为 2*2*3*2666e6*8 = 256 GB/s = 238 GIB/S

假设您的基准测试良好,并且每个处理器仅访问其NUMA节点,我希望UPI吞吐量非常低,并且远程NUMA页面数量很少。您可以使用硬件计数器进行检查。 Linux perf 或VTUNE使您可以相对轻松地检查此内容。

L3缓存分为 slices 。所有物理地址均使用 Hash函数在缓存切片中分布(请参阅此处有关更多信息)。此方法使处理器能够平衡所有L3切片之间的吞吐量。此方法还使处理器能够平衡两个IMC之间的吞吐量,从而在加密处理器中看起来像SMP架构而不是Numa One。这也用于Sandy Bridge和Xeon Phi处理器(主要用于减轻NUMA效应)。

散列并不能保证完美的平衡(没有哈希功能是完美的,尤其是快速计算的功能),但是在实践中通常相当不错,尤其是对于连续的访问。不良平衡会由于部分失速而减少内存吞吐量。这是您无法达到理论带宽的原因之一。

通过良好的哈希功能,平衡应独立于所用的核心数量。如果哈希函数不够好,那么一个IMC可能会比另一个IMC更饱和,而另一个IMC随着时间的推移振荡。坏消息是哈希功能是无证的,检查此行为很复杂:AFAIK您可以为每个IMC吞吐量获得硬件计数器,但它们的粒度有限,这很大。在我的Skylake机器上,硬件计数器的名称为 uncore_imc/data_reads/ uncore_imc/data_writes/,但在您的平台上,您肯定有4个计数器(每个IMC一个imc )。

幸运的是,英特尔提供了一项功能,称为 sub-numa clustering (SNC)(SNC)在您的Xeon SP处理器上。这个想法是将处理器分为具有自己专用的IMC的两个NUMA节点。由于哈希功能,这解决了平衡问题,因此只要您的应用程序对应用程序友好,就会导致内存操作更快。否则,由于numa效应,它实际上可能会明显较慢。在最坏的情况下,应用程序的页面都可以映射到相同的numa节点,仅一半的带宽可用。由于您的基准应该对数字友好,因此SNC应该更有效。


资料来源:

此外,在并行访问L3的更多内核可能会导致更多的早期驱逐预取的缓存线 ,当核心实际需要时,需要再次将其再次获取(随着额外的DRAM延迟时间到支付)。这种效果并不像看起来那样异常。确实,由于DDR4 DRAM的延迟很高,硬件预取单元必须提前很长时间预取数据,以减少潜伏期的影响。他们还需要同时执行很多请求。这通常不是顺序访问的问题,但是更多的核心会导致访问从caches和imcs观看点看起来更随机。事物是DRAM的设计,因此连续访问比随机访问更快(多个连续缓存线应连续加载以完全饱和带宽)。您可以分析 llc-load-misses 硬件计数器的价值,以检查是否使用更多线程重新填充了更多数据(我看到只有6核,但对基于Skylake的PC的影响它不足以对最终吞吐量造成任何明显影响)。为了减轻此问题,您可以使用软件非时空预取( prefetchnta 请求处理器将数据直接加载到线条加载缓冲区中,而不是L3缓存,从而导致较低的污染( /a>是一个相关的答案)。由于并发较低,这可能会较慢,而核心较少,但是在很多核心方面应该更快一些。请注意,这并不能解决从IMCS观看点看起来更随机的地址的问题,并且对此无关。

在实践中,低水平的建筑DRAM和缓存非常复杂。有关内存的更多信息,请参见以下链接:

Overview

This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.

More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.


Under the hood

The layout of Intel Xeon Skylake SP processors is like the following in your case:

processor-configuration

core-configuration
Source: Intel

There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM @ 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.

Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.

The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).

Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.

With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).

Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.

Sub-NUMA Clustering
Source: Intel

Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.

The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文