当前位置：文江博客话题详情

LRU、FIFO、随机

发布于 2024-11-27 15:25:45 字数 123 浏览 2 评论 0原文

当出现页面错误或缓存未命中时，我们可以使用最近最少使用 (LRU)、先入先出 (FIFO) 或随机替换算法。我想知道，哪一个提供了最好的性能，也称为未来缓存丢失/页面错误最少的可能性？

架构：Coldfire 处理器

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赠意 2024-12-04 15:25:45

您指定的架构是 68000，它是一个 CPU，而不是 GPU 或 USB 控制器，或者其他可能访问缓存的硬件...

因此，您在 68000 上运行的代码将对问题“最少可能的未来缓存丢失/页面错误”。

在这方面，您区分了缓存未命中和页面错误，我不确定您到底指的是哪种coldfire架构，但我假设这没有硬件TLB替换，它使用软件机制（因此缓存将是与应用程序的数据共享）。

在替换策略中最重要的因素是关联（或方式）的数量。

直接映射缓存（1 路）直接与地址的低位相关（几乎总是如此）（位数指定缓存的大小），因此 32k 缓存将是较低的 15 位。在这种情况下，替换算法 LRU、FIFO 或 Random 将毫无用处，因为只有一种可能的选择。

然而，缓存的回写或直写选择会产生更大的效果。仅适用于写入内存直写意味着高速缓存行未分配，与写回高速缓存相反，其中当前位于高速缓存中的共享相同低 15 位的行被从高速缓存中弹出并读回然后修改，以供使用 IF CPU 上运行的代码使用此数据）。

对于写入数据且不对数据执行多个操作的操作，直写通常要好得多，在现代处理器上也是如此（我不知道该架构是否支持它），但可以在 TLB/Page 上选择 Writethrough 或 Writeback基础。这对缓存的影响比策略大得多，您可以对系统进行编程以匹配每个页面中的数据类型，尤其是在直接地图缓存中;-)

因此直接地图缓存非常容易理解，它是也很容易理解缓存的最坏情况、最好情况和平均情况的基础。

想象一个 memcpy 例程，它复制与缓存大小对齐的数据。例如，一个 32k 直接映射缓存，两个 32k 缓冲区在 32k 边界上对齐......

0x0000 -> read
0x8000 -> write
0x8004 -> read
0x8004 -> write
...
0x8ffc -> read
0x8ffc -> write

在这里您可以看到复制每个数据字时的读取和写入，请注意每个读取和写入操作的低 15 位是相同的。

使用写回的直接映射缓存（记住写回分配行执行以下操作）

0x0000 -> read
 cache performs: (miss)
   0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)

0x8000 -> write
  cache performs: (miss)
    invalidate 0x0000:0x001f (line 0)
    0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
    0x8000           (modify this location in the cache with the read source data)

<loop>

0x0004 -> read
  cache performs: (miss)
    writeback 0x8000:0x801f -> WRITE to main memory (ie. write 32 bytes to the desitnation)
    0x0000:0x001f -> READ from main memory (ie. read 32 bytes of source (the same as we did just before)

0x8004 -> write
  cache performs: (miss)
    invalidate 0x0000:0x001f (line 0)
    0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
    0x8004           (modify this location in the cache with the read source data)

</loop>  <--- (side note XML is not a language but we use it as such)

当您看到大量内存操作正在进行时，这实际上称为“颠簸”，并且是最坏情况场景的最佳示例。

现在想象一下我们使用 writethrough 缓存，这些是操作：

<loop>
0x0000 -> read
 cache performs: (miss)
   0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)

0x8000 -> write
  cache performs: (not a miss)
   (not a lot, the write is "posted" to main memory) (posted is like a letter you just place it in the mailbox and you don't care if it takes a week to get there).

  <loop>

  0x0004 -> read
    cache performs: (hit)
      (not a lot, it just pulls the data it fetched last time which it has in it's memory so it goes very quickly to the CPU)

  0x8004 -> write
    cache performs: (not a miss)
     (not a lot, the write is "posted" to main memory)

  </loop until next 32 bytes>
</loop until end of buffer>

正如您所看到的巨大差异，我们现在不会崩溃，事实上我们是本示例中的最佳情况。

好的，这就是直写与回写的简单情况。

然而，直接映射缓存现在不太常见，大多数人使用 2,4 或 8 路缓存，即一行中有 2、4 或 8 个不同的可能分配。所以我们可以在4路或8路缓存中同时存储0x0000、0x8000、0x1000、0x1800（显然8路也可以存储0x2000、0x2800、0x3000、0x3800）。

这将避免这个颠簸问题。

只是为了澄清 32k 直接映射缓存中的行号是地址的底部 15 位。
在 32k 2 方式中，它是底部 14 位。
在 32k 4 方式中，它是底部 13 位。
在 32k 8 方式中，它是底部 12 位。

在全关联高速缓存中，它是行大小（或 32 字节行的底部 5 位）。你不能少于一行。 32 字节通常是 DDR 内存系统中最佳的操作（还有其他原因，有时 16 或有时 64 字节可能更好，并且在算法情况下 1 字节是最佳的，我们使用 32，因为这很常见）

为了帮助理解 LRU、FIFO 和随机，我们认为缓存是完全关联的，在 32k 32 字节行缓存中，这是 1024 行。

随机替换策略会随机导致最坏情况每 1024 个替换命中一次（即 99.9% 命中），在 LRU 或 FIFO 中我总是可以编写一个会“颠簸”的程序，即。总是会导致最坏情况的行为（即 0% 命中）。

显然，如果您有一个完全关联的缓存，那么只有在程序已知并且程序的确切行为已知的情况下，您才会选择 LRU 或 FIFO。

对于任何不是 99.9% 可预测的事情，你都会选择随机，它只是不最差的最好的，也是平均的最好的之一，但是最好的情况（我得到最好的性能）怎么样......

好吧基本上取决于方法的数量...

2 种方法，我可以优化诸如 memcpy 和其他算法之类的东西来做好工作。随机的时候有一半会出错。
4 种方式，当我在其他任务之间切换时，我可能不会对缓存造成太大破坏，以至于它们的数据仍然是本地的。随机有四分之一的时间会出错。
现在统计数据可以生效的 8 种方式 memcpy 上的 7/8% 命中率不如 1023/1024%（完全关联或优化的代码），但对于非优化的代码来说，它会产生影响。

那么为什么人们不使用随机替换策略来制作完全关联的缓存呢！

好吧，这并不是因为它们不能生成好的随机数，事实上伪随机数生成器也同样好，是的，我可以编写一个程序来获得 100% 的丢失率，但这不是重点，我不能编写一个 100% 未命中的有用程序，我可以使用 LRU 或 FIFO 算法来实现。

32k 32 字节行全关联高速缓存要求您比较 1024 个值，在硬件中这是通过 CAM 完成的，但这是一个昂贵的硬件，而且不可能在“快速”处理中比较这么多值时间，我想知道量子计算机是否可以……

无论如何，回答你的问题哪个更好：

考虑写通是否比写回更好。
大路的 RANDOM 更好
未知代码 RANDOM 对于 4 路或以上更好。
如果它是单一功能，或者您希望从愿意优化的东西中获得最大速度，或者您不关心最坏的情况，那么 LRU 可能就是您想要的。
如果您的方式很少，那么 LRU 可能就是您想要的，除非您有非常具体的场景，那么 FIFO 可能就可以。

参考文献：

You specify the architecture a 68000 which is a CPU rather than a GPU or USB controller, or other piece of hardware which may access a cache however...

Therefore the code you run on the 68000 will make a huge difference to the part of the question "least possible future cache miss'/page faults".

In this you differentiate between cache miss's and page faults, I'm not sure exactly which coldfire architecure you are refering to but I'm assuming this doesn't have a hardware TLB replacement, it uses a software mechansim (so the cache will be shared with the data of applications).

In the replacement policy the most important factor is the number of associations (or ways).

A direct map cache (1 way), directly correlates (all most always) with the low order bits of the address (the number of bits specify the size of the cache) so a 32k cache would be the lower 15bits. In this case the replacement algorthims LRU, FIFO or Random would be useless as there is only one possible choice.

However Writeback or Writethrough selection of the cache would have more effect. For writes to memory only Writethrough means the cache line isn't allocated as apposed to a writeback cache where the line currently in the cache which shares the same lower 15 bits is ejected out of the cache and read back in then modified, for use IF the code running on the CPU uses this data).

For operations that write and don't perform multiple operations on the data then writethrough is usually much better, also on modern processors (and I don' know if this architecture supports it), but Writethrough or Writeback can be selected on a TLB/Page basis. This can have a much greator effect on the cache than the policy, you can programme the system to match the type of data in each page, especially in a direct map cache ;-)

So a direct map cache is pretty easy to understand, it's also easy to understand the basis of cache's worst case, best case and average case.

Imagin a memcpy routine which copies data which is aligned to the cache size. For example a 32k direct mapped cache, with two 32k buffers aligned on a 32k boundry....

0x0000 -> read
0x8000 -> write
0x8004 -> read
0x8004 -> write
...
0x8ffc -> read
0x8ffc -> write

Here you see the read and write's as they copy each word of data, notice the lower 15 bits are the same for each read and write operation.

A direct mapped cache using writeback (remember writeback allocates lines does the following)

0x0000 -> read
 cache performs: (miss)
   0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)

0x8000 -> write
  cache performs: (miss)
    invalidate 0x0000:0x001f (line 0)
    0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
    0x8000           (modify this location in the cache with the read source data)

<loop>

0x0004 -> read
  cache performs: (miss)
    writeback 0x8000:0x801f -> WRITE to main memory (ie. write 32 bytes to the desitnation)
    0x0000:0x001f -> READ from main memory (ie. read 32 bytes of source (the same as we did just before)

0x8004 -> write
  cache performs: (miss)
    invalidate 0x0000:0x001f (line 0)
    0x8000:0x801f -> READ from main memory (ie. read 32 bytes of the destination)
    0x8004           (modify this location in the cache with the read source data)

</loop>  <--- (side note XML is not a language but we use it as such)

As you see a lot of memory operations go on, this in fact is called "thrashing" and is the best example of a worst case scenairo.

Now imagine that we use a writethrough cache, these are the operations:

<loop>
0x0000 -> read
 cache performs: (miss)
   0x0000:0x001f -> READ from main memory (ie. read 32 bytes of the source)

0x8000 -> write
  cache performs: (not a miss)
   (not a lot, the write is "posted" to main memory) (posted is like a letter you just place it in the mailbox and you don't care if it takes a week to get there).

  <loop>

  0x0004 -> read
    cache performs: (hit)
      (not a lot, it just pulls the data it fetched last time which it has in it's memory so it goes very quickly to the CPU)

  0x8004 -> write
    cache performs: (not a miss)
     (not a lot, the write is "posted" to main memory)

  </loop until next 32 bytes>
</loop until end of buffer>

As you can see a huge difference we now don't thrash, in fact we are best case in this example.

Ok so that's the simple case of write through vs. write back.

Direct map caches however are now not very common most people use, 2,4 or 8 way cache's, that is there are 2, 4 or 8 different possible allocations in a line. So we could store 0x0000, 0x8000, 0x1000, 0x1800 all in the cache at the same time in a 4 way or 8 way cache (obviously an 8 way could also store 0x2000, 0x2800, 0x3000, 0x3800 as well).

This would avoid this thrashing issue.

Just to clarify the line number in a 32k direct mapped cache is the bottom 15 bits of the address.
In a 32k 2 way it's the bottom 14 bits.
In a 32k 4 way it's the bottom 13 bits.
In a 32k 8 way it's the bottom 12 bits.

And in a fully associatve cache it's the lines size (or the bottom 5 bits with a 32 byte line). You can't have less than a line. 32 bytes is typically the most optimum operation in a DDR memory system (there are other reasons as well, someimes 16 or sometimes 64 bytes can be better, and 1 byte would be optimal in the algorthmic case, lets uses 32 as that's very common)

To help understand LRU, FIFO and Random consider the cache is full associative, in a 32k 32 byte line cache this is 1024 lines.

A random replacement policy would on random cause a worst case hit every 1024 replacements (ie. 99.9% hit), in either LRU or FIFO I could always write a programme which would "thrash" ie. always cause a worst case behavouir (ie. 0% hit).

Clearly if you had a fully associative cache you would only choose LRU or FIFO if the programme was known and it was known the exact behavouir of the programme.

For ANYTHING which wasn't 99.9% predictable you would choose RANDOM, it's simply the best at not being worst, and one of the best for being average, but how about best case (where I get the best performance)...

Well it depends basically on the number of ways...

2 ways and I can optimise things like memcpy and other algorthims to do a good job. Random would get it wrong half the time.
4 ways and when I switch between other tasks I might not damage the cache so much that their data is still local. Random would get it wrong quater of the time.
8 ways now statistics can take effect a 7/8% hit rate on a memcpy isn't as good as a 1023/1024% (fully associative or optimised code) but for non optimised code it makes a difference.

So why don't people make fully associative cache's with random replacement policies!

Well it's not because they can't generate good random numbers, in fact a pseudo random number generator is just as good and yes I can write a programme to get 100% miss rate, but that isn't the point, I couldn't write a useful programme that would have 100% miss, which I could with an LRU or FIFO algo.

A 32k 32 byte line Fully associatve cache's require you to compare 1024 values, in hardware this is done through a CAM, but this is an expensive piece of hardware, and also it's just not possible to compare this many values in a "FAST" processing time, I wonder if a quantum computer could....

Anyway to answer your question which one is better:

Consider if writethrough might be better than writeback.
Large way's RANDOM is better
Unknown code RANDOM is better for 4 way or above.
If it's single function or you want the most speed from something your willing to optimise, and or you don't care about worst case, then LRU is probably what you want.
If you have very few way's LRU is likely what you want unless you have a very specific scenario then FIFO might be OK.

References:

回复收藏 0 原文