当 L1 未命中与 L2 访问有很大不同时...TLB 相关吗?
我一直在对一些算法运行一些基准测试,并分析它们的内存使用情况和效率(L1/L2/TLB 访问和未命中),其中一些结果对我来说非常有趣。
考虑到包容性缓存层次结构(L1 和 L2 缓存),L1 缓存未命中的数量不应该与L2 缓存访问的数量一致吗?我发现的解释之一与 TLB 相关:当虚拟地址未映射到 TLB 中时,系统会自动跳过某些缓存级别的搜索。 这看起来合法吗?
I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me.
Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels.
Does this seem legitimate?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,包容性缓存层次结构可能并不像您想象的那么常见。例如,我认为当前任何英特尔处理器(Nehalem、Sandybridge 或 Atoms 都不是)都没有包含在 L2 中的 L1。 (然而,Nehalem 和 Sandybridge 确实将 L1 和 L2 都包含在 L3 中;使用英特尔当前的术语,LLC 中的 FLC 和 MLC。)
但是,这并不一定重要。在大多数缓存层次结构中,如果存在 L1 缓存未命中,则可能会在 L2 中查找该未命中。包容与否并不重要。否则,您必须有一些东西告诉您您关心的数据(可能)不在 L2 中,您不需要查看。尽管我设计了执行此操作的协议和内存类型 - 例如,仅在 L1 中缓存但不在 L2 中缓存的内存类型,对于图形之类的东西很有用,您可以在 L1 中获得组合的好处,但您需要重复扫描一个大数组,所以在 L2 中缓存不是一个好主意。我不知道目前有人运送它们。
无论如何,以下是导致 L1 缓存未命中次数可能不等于 L2 缓存访问次数的一些原因。
你没有说你正在使用什么系统 - 我知道我的答案适用于 Intel x86,例如 Nehalem 和 Sandybridge,它们的 EMON 性能事件监控允许你对 L1 和 L2 缓存未命中等进行计数。它可能会也适用于任何具有缓存未命中硬件性能计数器的现代微处理器,例如 ARM 和 Power 上的微处理器。
大多数现代微处理器不会在第一次缓存未命中时停止,而是继续尝试做额外的工作。这通常被称为推测执行。此外,处理器可能是有序的或无序的,但尽管后者可能会给您带来 L1 未命中数和 L2 访问数之间更大的差异,但这是没有必要的 - 即使在 in- 上您也可以得到这种行为订单处理者。
简短回答:许多推测性内存访问将针对同一内存位置。它们将被压缩并合并。
性能事件“L1 缓存未命中”可能是[*]计算未命中 L1 缓存的(推测)指令的数量。然后分配一个硬件数据结构,在英特尔称为填充缓冲区,在其他地方称为未命中状态处理寄存器。同一高速缓存行的后续高速缓存未命中将错过 L1 高速缓存,但会命中填充缓冲区,并会被压缩。只有其中一个(通常是第一个)会被发送到 L2,并计为 L2 访问。)
顺便说一下,可能有一个性能事件:Squashed_Cache_Misses。
还可能存在性能事件 L1_Cache_Misses_Retired。但这可能会低估,因为推测可能会将数据拉入缓存,并且退休时的缓存未命中可能永远不会发生。
([*] 顺便说一句,当我在这里说“可能”时,我的意思是“在我帮助设计的机器上”。几乎可以肯定。我可能需要检查定义,查看 RTL,但如果几乎可以保证。)
例如,假设您正在访问字节 A[0]、A[1]、A[2]、... A[63]、A[64]、...
如果地址A[0] 等于零模64,那么 A[0]..A[63] 将位于具有 64 字节缓存行的计算机上的同一缓存行中。如果使用这些的代码很简单,那么很可能所有这些都可以推测性地发布。 QED:64 次推测内存访问,64 次 L1 缓存未命中,但只有 1 次 L2 内存访问。
(顺便说一句,不要指望这些数字会那么干净。每次 L2 访问您可能不会精确地获得 64 次 L1 访问。)
更多可能性:
如果 L2 访问次数大于 L1 缓存未命中次数(我几乎从未见过它,但有可能)您可能有一个内存访问模式,它混淆了硬件预取器。硬件预取器尝试预测您将需要哪些缓存行。 如果预取器预测错误,它可能会获取您实际上不需要的缓存行。通常会出现计算 Prefetches_from_L2 或 Prefetches_from_Memory 的性能问题。
某些机器可能会在导致 L1 缓存未命中的推测访问被发送到 L2 之前取消它们。不过,我不知道英特尔有这样做。
First, inclusive cache hierarchies may not be so common as you assume. For example, I do not think any current Intel processors - not Nehalem, not Sandybridge, possibly Atoms - have an L1 that is included within the L2. (Nehalem and probably Sandybridge do, however, have both L1 and L2 included within L3; using Intel's current terminology, FLC and MLC in LLC.)
But, this doesn't necessarily matter. In most cache hierarchies if you have an L1 cache miss, then that miss will probably be looked up in the L2. Doesn't matter if it is inclusive or not. To do otherwise, you would have to have something that told you that the data you care about is (probably) not in the L2, you don't need to look. Although I have designed protocols and memory types that do this - e.g. a memory type that cached only in the L1 but not the L2, useful for stuff like graphics where you get the benefits of combining in the L1, but where you are repeatedly scanning over a large array, so caching in the L2 not a good idea. Bit I am not aware of anyone shipping them at the moment.
Anyway, here are some reasons why the number of L1 cache misses may not be equal to the number of L2 cache accesses.
You don't say what systems you are working on - I know my answer is applicable to Intel x86s such as Nehalem and Sandybridge, whose EMON performance event monitoring allows you to count things such as L1 and L2 cache misses, etc. It will probably also apply to any modern microprocessor with hardware performance counters for cache misses, such as those on ARM and Power.
Most modern microprocessors do not stop at the first cache miss, but keep going trying to do extra work. This is overall often called speculative execution. Furthermore, the processor may be in-order or out-of-order, but although the latter may given you even greater differences between number of L1 misses and number of L2 accesses, it's not necessary - you can get this behavior even on in-order processors.
Short answer: many of these speculative memory accesses will be to the same memory location. They will be squashed and combined.
The performance event "L1 cache misses" is probably[*] counting the number of (speculative) instructions that missed the L1 cache. Which then allocate a hardware data structure, called at Intel a fill buffer, at some other places a miss status handling register. Subsequent cache misses that are to the same cache line will miss the L1 cache but hit the fill buffer, and will get squashed. Only one of them, typically the first will get sent to the L2, and counted as an L2 access.)
By the way, there may be a performance event for this: Squashed_Cache_Misses.
There may also be a performance event L1_Cache_Misses_Retired. But this may undercount, since speculation may pull the data into the cache, and a cache miss at retirement may never occur.
([*] By the way, when I say "probably" here I mean "On the machines that I helped design". Almost definitely. I might have to check the definition, look at the RTL, but I would be immensely surprised if not. It is almost guaranteed.)
E.g. imagine that you are accessing bytes A[0], A[1], A[2], ... A[63], A[64], ...
If the address of A[0] is equal to zero modulo 64, then A[0]..A[63] will be in the same cache line, on a machine with 64 byte cache lines. If the code that uses these is simple, it is quite possible that all of them can be issued speculatively. QED: 64 speculative memory access, 64 L1 cache misses, but only one L2 memory access.
(By the way, don't expect the numbers to be quite so clean. You might not get exactly 64 L1 accesses per L2 access.)
Some more possibilities:
If the number of L2 accesses is greater than the number of L1 cache misses (I have almost never seen it, but it is possible) you may have a memory access pattern that is confusing a hardware prefetcher. The hardware prefetcher tries to predict which cache lines you are going to need. If the prefetcher predicts badly, it may fetch cache lines that you don't actually need. Oftentimes there is a performance evernt to count Prefetches_from_L2 or Prefetches_from_Memory.
Some machines may cancel speculative accesses that have caused an L1 cache miss, before they are sent to the L2. However, I don't know of Intel doing this.
数据高速缓存的写入策略确定存储命中是否仅将其数据写入该高速缓存(回写或回拷),还是也写入高速缓存层次结构的下一级(直写)。
因此,命中直写式 L1-D 高速缓存的存储也会将其数据写入 L2 高速缓存。
这可能是并非来自 L1 缓存未命中的 L2 访问的另一个来源。
The write policy of a data cache determines whether a store hit writes its data only on that cache (write-back or copy-back) or also at the following level of the cache hierarchy (write-through).
Hence, a store that hits at a write-through L1-D cache, also writes its data at the L2 cache.
This could be another source of L2 accesses that do not come from L1 cache misses.