现代 CPU 缓存是否经过优化以应对不断变化的情况?跨线程?

发布于 2024-08-10 03:20:37 字数 352 浏览 3 评论 0原文

假设我有一个大数组,并且有多个线程从该数组读取数据。每个线程通过跳转恒定量来迭代数组,但从不同的偏移量开始。因此,线程 1 可能从元素 0 开始,然后读取元素 32、64、96 等。但是线程 2 从元素 1 开始,读取元素 33、65、97 等(请记住,“元素”可能构成超过一个字节或单词)我知道通常空间局部性对于获得最佳缓存性能是可取的,但我也读到现代 CPU 具有硬件预取器,可以查找访问模式,对我来说,迈出一步似乎是明显的图案。

  • 那么这个缓存在现代机器上是否友好?
  • 如果我将步幅增加到大于缓存行会怎样?
  • 答案是否受到使用多个线程的影响(因此,尽管访问相同的内存,但它们可能运行在具有不同缓存的不同核心上)?

Say I have a big array, and multiple threads reading from the array. Each thread iterates through the array by jumping a constant amount, but starts at a different offset. So thread 1 may start at element 0, then read elements 32, 64, 96, etc. But thread 2 starts at element 1, and read element 33, 65, 97, etc. (keeping in mind that an 'element' may constitute more than a single byte or word) I know that usually spatial locality is desirable for getting the best cache performance, but I've also read that modern CPUs have hardware prefetchers that look for patterns in accesses, and a stride to me seems like an obvious pattern.

  • So is this cache friendly on a modern box or isn't it?
  • What if I increase the stride to something larger than a cache line?
  • Is the answer affected by the use of multiple threads (so despite accessing the same memory they may be running on different cores with different caches)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

幸福%小乖 2024-08-17 03:20:37

缓存性能非常复杂,真正可靠的答案将来自专门负责调度调度的硬件设计人员或操作系统开发人员。我曾经在大型IBM系统上的性能分析工具中工作过,所以我可以给出一个部分的、稍微过时的答案:

首先,缓存内存是通过地址关联的。如果对一块内存进行寻址,则该地址的“高速缓存行”将被加载到高速缓存中。根据处理器设计,长度可能为 4、8、16 或 32 字节。 (也许更多。)这很可能基于硬件地址的“对齐”;换句话说,32 字节线将位于与可被 32 整除的地址对齐的边界上。您的内存引用可能位于该缓存行的开头、中间或结尾。

一旦进入缓存,该地址将被用作“查找”来查找缓存的数据。

如果缓存行足够大,以至于引用了恰好已作为缓存行的一部分进行缓存的“相邻”项,则引用局部性将对您有所帮助。跳过你的数组将会打败这个。

缓存设计因供应商、产品线、处理器价格等因素的不同而有很大差异。完美的缓存优化将是非常难以捉摸的,除非(1)您对将要运行的机器有很多了解,并且(2)您真的对在任何其他机器上运行不感兴趣。

另一个需要考虑的因素是 32 位地址的大小是 64 位地址的一半,这对可以缓存的数据量有很大影响。为地址提供更多位或多或少意味着为数据提供更少的位。

预取与其说是科学,不如说是巫术。从数据获取内存到缓存的成本很高,即使它与处理器执行异步(尽管它永远不会与执行分离太远)。引用局部性是一个很好的规则,尽管它将基于硬件架构,但不一定与微观尺度上的代码执行相匹配。 LRU(最近最少使用)是决定从缓存中启动哪些内容的常用方法,但是从缓存中删除某些内容以为最终不再使用的内容腾出空间并不是一个好的优化。因此,至少可以说,预取是明智的。

编辑:虚拟内存问题、任务切换等。

虚拟内存无疑使事情变得更加有趣,尤其是在支持多个地址空间的操作系统中。缓存很可能基于真实地址,而不是虚拟地址,因此页面交换之类的事情可能会对缓存产生有趣的副作用。通常,要换出或释放的页面将首先失效,并移动到“刷新列表”(可以将其写入交换文件)或“空闲列表”。根据实现的不同,这些页面仍然可以被应用程序回收,但它们不再是可寻址的 - 这意味着在回收它们的过程中会发生页面错误。因此,一旦页面移出应用程序的工作集,与之关联的任何缓存行很可能都会失效。如果页面没有被大量使用,那么缓存中也不可能有太多内容,但在大量交换的情况下,缓存性能可能会与交换性能一起受到影响。

此外,一些高速缓存设计具有“共享”高速缓存,并且大多数或全部具有特定于处理器和内核的高速缓存。当高速缓存被指定给特定处理器或核心,并且该核心改变任务时,整个高速缓存可能被刷新以避免被新进程损坏。这不包括线程切换,因为线程在同一进程和同一地址空间中运行。这里真正的问题是系统上其他应用程序的高活动可能会影响您的缓存性能。共享缓存在一定程度上缓解了这个问题,但必须更仔细地管理以避免损坏。

Cache performance is pretty complex, and the really reliable answers are going to come from hardware designers or operating system developers who work specifically with scheduling an dispatching. I used to work in performance analysis tools on large IBM systems, so I can give a partial, slightly-out-of-date answer:

First, cache memory is associative by address. If a piece of memory is addressed, the "cache line" for that address is loaded into cache. Depending on processor design, this could be 4, 8, 16, or 32 bytes in length. (Maybe more.) This will most likely be based on "alilgnment" of hardware addresses; in other words, a 32-byte line will be on a boundary that aligns with a divisible-by-32 address. Your memory reference may be in the beginning, middle, or end of that cache line.

Once it's in the cache, the address is used as a "lookup" to find the cached data.

Locality of reference will help you if the cache line is large enough that an "adjacent" item is referenced that happens to have been cached as part of the cache line. Jumping through your array will defeat this.

Cache designs vary widely based on vendor, product line, processor price, and a whole lot more. Perfect cache optimization is going to be highly elusive unless (1) you know a whole lot about the machine you're going to run on, and (2) you're really not interested in running on any other machine.

One other factor to consider is that 32-bit addresses are half the size of 64-bit addresses, and this has a significant effect on how much data can be cached. Giving more bits to addresses means fewer bits for data, more-or-less.

Prefetching is more witchcraft than science. Fetching memory from data to cache is expensive, even when it's asynchronous from processor execution (although it can't ever be too far separated from execution). Locality of reference is a good rule, although it's going to be based on hardware architecture in ways that don't necessarily match code execution on the micro scale. LRU (least recently used) is a common method of deciding what to boot out of cache, but removing something from cache to make room for something that ends up not being used ever is not such a good optimization. So prefetching will be judicious, to say the least.

EDIT: virtual memory issues, task switching, etc.

Virtual memory certainly makes things far more interesting, especially in operating systems that support multiple address spaces. Cache is most likely to be based on real addresses, not virtual addresses, so things like page swaps can have interesting side-effects on caching. Typically, a page that is due to be swapped out or released will first be invalidated, and moved to a "flush list" (where it can be written to the swap file), or a "free list". Depending on implementation, these pages can still be reclaimed by an application, but they're no longer addressable - meaning a page fault would occur in the process of reclaiming them. So once a page has been moved out of a app's working set, it's very likely that any cache lines associated with it would be invalidated. If the page isn't being heavily used, then it's not likely to have much in cache, either, but in a heavy swapping situation, cache performance can take a hit along with swapping performance.

Also, some cache designs have "shared" cache, and most or all have processor- and core-specific cache. Where cache is designated to a specific processor or core, and that core changes task, the entire cache is likely to be flushed to avoid corruption by a new process. This would not include thread-switching, since threads run in the same process and same address space. The real issue here is that high activity in other applications on a system can impact your cache performance. Shared cache aleviates this problem to some extent, but has to be more carefully managed to avoid corruptions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文