缓存线如何工作?

发布于 2024-09-27 12:48:08 字数 240 浏览 13 评论 0原文

据我所知,处理器通过缓存线将数据带入缓存,例如,在我的 Atom 处理器上,无论读取的实际数据大小如何,一次都会引入大约 64 个字节。

我的问题是:

想象一下,你需要从内存中读取一个字节,这64个字节将被带入缓存中?

我可以看到的两种可能性是,64 字节从感兴趣字节下方最近的 64 字节边界开始,或者 64 字节以某种预定方式分布在该字节周围(例如,一半在下面、一半在上面,或者以上)。

是哪一个?

I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being read.

My question is:

Imagine that you need to read one byte from memory, which 64 bytes will be brought into the cache?

The two possibilities I can see is that, either the 64 bytes start at the closest 64 bytes boundary below the byte of interest, or the 64 bytes are spread around the byte in some predetermined way (for instance, half under, half above, or all above).

Which is it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

骄傲 2024-10-04 12:48:08

如果包含要加载的字节或字的高速缓存行尚未存在于高速缓存中,则 CPU 将请求从高速缓存行边界开始的 64 个字节(您需要的地址下方的最大地址是 64 的倍数) 。

现代 PC 内存模块一次传输 64 位(8 字节),突发八次传输,因此一个命令会触发从内存中读取或写入完整的缓存行。 (DDR1/2/3/4 SDRAM 突发传输大小可配置为最高 64B;CPU 将选择突发传输大小以匹配其缓存行大小,但 64B 很常见)

根据经验,如果处理器无法预测一次内存访问(并预取),检索过程可能需要约 90 纳秒或约 250 个时钟周期(从 CPU 知道地址到 CPU 接收数据)。

相比之下,在现代 x86 CPU 上,L1 缓存中的命中具有 3 到 5 个周期的加载使用延迟,而存储重新加载具有 4 或 5 个周期的存储转发延迟。其他架构上的情况也类似。


进一步阅读:Ulrich Drepper 的每个程序员都应该了解内存知识 。 DRAM 和缓存细节仍然相关。另请参阅如何“每个程序员应该了解的内存知识”的大部分内容仍然有效吗? - 软件预取建议有点过时:现代硬件预取器更智能,超线程比 P4 时代要好得多(所以预取线程通常是一种浪费)。另外, 标签 wiki 有很多性能链接那个架构。

If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).

Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory. (DDR1/2/3/4 SDRAM burst transfer size is configurable up to 64B; CPUs will select the burst transfer size to match their cache line size, but 64B is common)

As a rule of thumb, if the processor can't forecast a memory access (and prefetch it), the retrieval process can take ~90 nanoseconds, or ~250 clock cycles (from the CPU knowing the address to the CPU receiving data).

By contrast, a hit in L1 cache has a load-use latency of 3 to 5 cycles, and a store-reload has a store-forwarding latency of 4 or 5 cycles on modern x86 CPUs. Things are similar on other architectures.


Further reading: Ulrich Drepper's What Every Programmer Should Know About Memory. The DRAM and cache details are still relevant. See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - The software-prefetch advice is a bit outdated: modern HW prefetchers are smarter, and hyperthreading is way better than in P4 days (so a prefetch thread is typically a waste). Also, the tag wiki has lots of performance links for that architecture.

牛↙奶布丁 2024-10-04 12:48:08

首先,主存储器访问非常昂贵。目前,2GHz CPU(最慢的一次)每秒有 2G 的滴答(周期)。 CPU(现在的虚拟核心)可以在每个时钟周期从其寄存器中获取一个值。由于虚拟核心由多个处理单元(ALU - 算术逻辑单元、FPU 等)组成,如果可能的话,它实际上可以并行处理某些指令。

一次访问主存的时间大约为 70ns 到 100ns(DDR4 稍快一些)。这次基本上是查找 L1、L2 和 L3 缓存,然后访问内存(将命令发送到内存控制器,内存控制器将其发送到内存组),等待响应并完成。

100ns 意味着大约 200 个刻度。所以基本上,如果一个程序总是错过每个内存访问的缓存,CPU 将花费大约 99.5% 的时间(如果它只读取内存)空闲等待内存。

为了加快速度,有 L1、L2、L3 缓存。他们使用直接放置在芯片上的存储器,并使用不同类型的晶体管电路来存储给定的位。这比主内存需要更多的空间、更多的能源并且成本更高,因为 CPU 通常是使用更先进的技术生产的,L1、L2、L3 内存的生产故障有可能使 CPU 变得毫无价值(缺陷),因此大型 L1、L2、L3 缓存会增加错误率,从而降低良率,从而直接降低投资回报率。因此,在可用缓存大小方面存在巨大的权衡。

(目前,人们创建了更多的 L1、L2、L3 高速缓存,以便能够停用某些部分,从而减少实际生产缺陷是高速缓存内存区域导致 CPU 整体缺陷的可能性)。

给出一个计时想法(来源:访问缓存和内存的成本)

  • L1 缓存:1ns 至 2ns(2-4 个周期)
  • L2 缓存:3ns 至 5ns(6-10 个周期)
  • 缓存:12ns 至 20ns(24-40 个周期)
  • RAM:60ns(120 个周期)

L3 混合不同的 CPU 类型,这些只是估计值,但可以很好地了解获取内存值时实际发生的情况,并且我们可能在某些缓存层中命中或未命中。

因此,缓存基本上可以大大加快内存访问速度(60 纳秒 vs 1 纳秒)。

获取一个值,将其存储在缓存中以便重新读取它,这对于经常访问的变量来说是有好处的,但对于内存复制操作来说,它仍然会很慢,因为人们只是读取一个值,将该值写入某处,而从不读取该值再次...没有缓存命中,速度非常慢(此外,由于我们执行无序,因此这可能会并行发生)。

这种内存复制非常重要,因此有不同的方法可以加快它的速度。早期,内存通常能够复制 CPU 外部的内存。它由内存控制器直接处理,因此内存复制操作不会污染缓存。

但除了普通的内存复制之外,其他串行内存访问也很常见。一个例子是分析一系列信息。拥有一个整数数组并计算总和、均值、平均值或更简单地查找某个值(过滤器/搜索)是每次在任何通用 CPU 上运行的另一类非常重要的算法。

因此,通过分析内存访问模式,很明显,数据经常被顺序读取。如果程序读取,很有可能
索引 i 处的值,程序还将读取值 i+1。这个概率略高于同一程序也读取值 i+2 等的概率。

因此,给定一个内存地址,提前读取并获取附加值曾经是(现在仍然是)一个好主意。这就是为什么有升压模式的原因。

升压模式下的内存访问意味着发送一个地址并顺序发送多个值。每个额外的值发送只需要大约额外的 10ns(甚至更少)。

另一个问题是地址。发送地址需要时间。为了寻址大部分存储器,必须发送大地址。在早期,这意味着地址总线不够大,无法在单个周期(滴答声)内发送地址,并且需要多个周期来发送地址,从而增加了更多延迟。

例如,64 字节的高速缓存行意味着内存被划分为大小为 64 字节的不同(不重叠)内存块。 64 字节意味着每个块的起始地址的最低 6 个地址位始终为零。因此,对于任意数量的地址总线宽度,每次发送这六个零位不需要将地址空间增加 64 倍(欢迎效果)。

高速缓存行解决的另一个问题(除了预读和保存/释放地址总线上的六位之外)是高速缓存的组织方式。例如,如果高速缓存被划分为 8 字节(64 位)块(单元),则需要存储内存单元的地址,该高速缓存单元将保存其值。如果地址也是 64 位,则意味着该地址消耗了一半的缓存大小,从而导致 100% 的开销。

由于缓存行是 64 字节,CPU 可能使用 64 位 - 6 位 = 58 位(无需将零位存储得太正确),这意味着我们可以缓存 64 字节或 512 位,但开销为 58 位(11% 开销)。实际上,存储的地址甚至比这个更小,但有状态信息(例如缓存行是否有效且准确,是否脏,是否需要写回内存等)。

另一方面是我们有组关联缓存。并非每个缓存单元都能够存储某个地址,而只能存储其中的一个子集。这使得必要的存储地址位甚至更小,允许并行访问高速缓存(每个子集可以访问一次但独立于其他子集)。

尤其是当涉及到同步不同虚拟核心之间的高速缓存/内存访问、每个核心独立的多个处理单元以及最后一个主板上的多个处理器(有些主板可容纳多达 48 个处理器甚至更多)时。

这基本上就是我们拥有缓存线的当前想法。提前读取的好处非常高,并且从缓存行中读取单个字节并且不再读取其余字节的最坏情况非常小,因为概率非常小。

缓存行的大小 (64) 是较大缓存行之间明智选择的权衡,这使得在不久的将来不太可能读取它的最后一个字节,即获取完整缓存行所需的持续时间从内存中读取(并将其写回)以及缓存组织以及缓存和内存访问的并行化的开销。

First of all a main memory access is very expensive. Currently a 2GHz CPU (the slowest once) has 2G ticks (cycles) per second. A CPU (virtual core nowadays) can fetch a value from its registers once per tick. Since a virtual core consists of multiple processing units (ALU - arithmetic logic unit, FPU etc) it can actually process certain instructions in parallel if possible.

An access of the main memory costs about 70ns to 100ns (DDR4 is slightly faster). This time is basically looking up the L1, L2 and L3 cache and than hit the memory (send command to memory controller, which sends it to the memory banks), wait for the response and done.

100ns means about 200 ticks. So basically if a program would always miss the caches which each memory access, the CPU would spend about 99,5% of its time (if it only reads memory) idle waiting for the memory.

In order to speed things up there is the L1, L2, L3 caches. They use memory being directly placed on the chip and using a different kind of transistor circuits to store the given bits. This takes more room, more energy and is more costly than the main memory since a CPU is usually produced using a more advanced technology and a production failure in the L1, L2, L3 memory has the chance to render the CPU worthless (defect) so large L1, L2, L3 caches increase the error rate which decreases the yield which directly decreases ROI. So there is a huge trade off when it comes to available cache size.

(currently one creates more L1, L2, L3 caches in order to be able to deactivate certain portions to decrease the chance that an actual production defect is the cache memory areas renders the CPU defect as a whole).

To give a timing idea (source: costs to access caches and memory)

  • L1 cache: 1ns to 2ns (2-4 cycles)
  • L2 cache: 3ns to 5ns (6-10 cycles)
  • L3 cache: 12ns to 20ns (24-40 cycles)
  • RAM: 60ns (120 cycles)

Since we mix different CPU types these are just estimates but give a good idea whats really going when a memory value is fetched and we might have a hit or a miss in certain cache layer.

So a cache basically speeds up memory access greatly (60ns vs. 1ns).

Fetching a value, storing it in the cache for the chance of rereading it is good for variables that are often accessed but for memory copy operations it would be still to slow since one just reads a value, writes the value somewhere and never reads the value again... no cache hits, dead slow (beside this can happen in parallel since we have out of order execution).

This memory copy is so important that there are different means to speed it up. In early days memory was often able to copy memory outside of the CPU. It was handled by the memory controller directly, so a memory copy operation did not pollute the caches.

But beside from a plain memory copy other serial access of memory was quite common. An example is analyzing a series of information. Having an array of integers and calculating sum, mean, average or even more simple find a certain value (filter/search) were another very important class of algorithms run every time on any general purpose CPU.

So by analyzing memory access pattern it was apparent that data is read sequentially very often. There was a high probability that if a program reads
the value at index i, that the program will also read the value i+1. This probability is slightly higher than the probability that the same program will also read the value i+2 and so on.

So given a memory address it was (and still is) a good idea to read ahead and fetch additional values. This is the reason why there is a boost mode.

Memory access in boost mode means, that an address is send and multiple values are sequentially send. Each additional value send only takes about additional 10ns (or even below).

Another problem was an address. Sending an address takes time. In order to address a large portion of memory large addresses has to be send. In early days it meant that the address bus was not big enough to send the address in a single cycle (tick) and more than one cycle was needed to send the address adding more delay.

A cache line of 64 bytes for instance means that the memory is divided in distinct (non-overlapping) blocks of memory being 64bytes in size. 64bytes mean the start address of each block has the lowest six address bits to be always zeros. So sending these six zero bits each time is not needed increasing the address space 64 times for any number of address bus width (welcome effect).

Another problem the cache line solves (beside reading ahead and saving / freeing six bits on the address bus) is in the way the cache is organized. For example if a cache would be divided in 8 byte (64bit) blocks (cells) one needs to store the address of the memory cell this cache cell holds the value for along with it. If the address would also be 64bit this means that half the cache size is consumed by the address resulting in an overhead of 100%.

Since a cache line is 64bytes and a CPU might use 64bit - 6bit = 58bit (no need to store the zero bits too right) means we can cache 64bytes or 512bits with an overhead of 58bit (11% overhead). In reality the addresses stored are even smaller than this but there are status informations (like is the cache line valid and accurate, dirty and needs to wrote back in ram etc).

Another aspect is that we have set-associative cache. Not every cache cell is able to store a certain address but only a subset of those. This makes the necessary stored address bits even smaller, allows parallel access of the cache (each subset can be accessed once but independent of the other subsets).

There is more especially when it comes to synchronize cache/memory access between the different virtual cores, their independent multiple processing units per core and finally multiple processors on one mainboard (which there are boards housing as much as 48 processors and more).

This is basically the current idea why we have cache lines. The benefit from reading ahead is very high and the worst case of reading a single byte out of a cache-line and never read the rest again is very slim since the probability is very slim.

The size of the cache-line (64) is a wise chosen trade-off between larger cache-lines makes it unlikely for the last byte of it to be read also in the near future, the duration it takes to fetch the complete cache line from memory (and to write it back) and also the overhead in cache organization and the parallelization of cache and memory access.

摘星┃星的人 2024-10-04 12:48:08

如果高速缓存行是 64 字节宽,则它们对应于从可被 64 整除的地址开始的内存块。任何地址的最低有效 6 位是高速缓存行的偏移量。

因此,对于任何给定的字节,可以通过清除地址的最低有效六位来找到必须获取的缓存行,这对应于向下舍入到可被 64 整除的最近地址。

虽然这是由硬件完成的,我们可以使用一些参考 C 宏定义来显示计算:

#define CACHE_BLOCK_BITS 6
#define CACHE_BLOCK_SIZE (1U << CACHE_BLOCK_BITS)  /* 64 */
#define CACHE_BLOCK_MASK (CACHE_BLOCK_SIZE - 1)    /* 63, 0x3F */

/* Which byte offset in its cache block does this address reference? */
#define CACHE_BLOCK_OFFSET(ADDR) ((ADDR) & CACHE_BLOCK_MASK)

/* Address of 64 byte block brought into the cache when ADDR accessed */
#define CACHE_BLOCK_ALIGNED_ADDR(ADDR) ((ADDR) & ~CACHE_BLOCK_MASK)

If cache lines are 64 bytes wide, then they correspond to blocks of memory which start on addresses that are divisible by 64. The least significant 6 bits of any address are an offset into the cache line.

So for any given byte, the cache line which has to be fetched can be found by clearing the least signficant six bits of the address, which corresponds to rounding down to the nearest address that is divisible by 64.

Though this is done by hardware, we can show the calculations using some reference C macro definitions:

#define CACHE_BLOCK_BITS 6
#define CACHE_BLOCK_SIZE (1U << CACHE_BLOCK_BITS)  /* 64 */
#define CACHE_BLOCK_MASK (CACHE_BLOCK_SIZE - 1)    /* 63, 0x3F */

/* Which byte offset in its cache block does this address reference? */
#define CACHE_BLOCK_OFFSET(ADDR) ((ADDR) & CACHE_BLOCK_MASK)

/* Address of 64 byte block brought into the cache when ADDR accessed */
#define CACHE_BLOCK_ALIGNED_ADDR(ADDR) ((ADDR) & ~CACHE_BLOCK_MASK)
平定天下 2024-10-04 12:48:08

处理器可能具有多级缓存(L1、L2、L3),这些缓存的大小和速度有所不同。

然而,要了解每个缓存中的具体内容,您必须研究该特定处理器使用的分支预测器,以及程序的指令/数据如何对其进行操作。

了解分支预测器CPU 缓存替换策略

这不是一件容易的事。如果最终您想要的只是性能测试,则可以使用类似 Cachegrind。然而,由于这是模拟,其结果可能会有所不同。

Processors may have multi-level caches (L1, L2, L3), and these differ on size and speed.

Yet, to understand what exactly goes into each cache you'll have to study the branch predictor used by that specific processor, and how the instructions/data of your program behave against it.

Read about branch predictor, CPU cache and replacement policies.

This is not an easy task. If at the end of the day all you want is a performance test, you can use a tool like Cachegrind. However, as this is a simulation, its result may differ at some degree.

入画浅相思 2024-10-04 12:48:08

我不能肯定地说,因为每个硬件都不同,但它通常是“64 字节从下面最近的 64 字节边界开始”,因为这对于 CPU 来说是一个非常快速且简单的操作。

I can't say for certain as every hardware is different, but it is typically "64 bytes start at the closest 64 bytes boundary below" as that is a very fast and simple operation for the CPU.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文