LRU、FIFO、随机
当出现页面错误或缓存未命中时,我们可以使用最近最少使用 (LRU)、先入先出 (FIFO) 或随机替换算法。我想知道,哪一个提供了最好的性能,也称为未来缓存丢失/页面错误最少的可能性?
架构:Coldfire 处理器
When there is a page fault or a cache miss we can use either the Least Recently Used (LRU), First in Fist Out (FIFO) or Random replacement algorithms. I was wondering, which one provides the best performance aka the least possible future cache miss'/page faults?
Architecture: Coldfire processor
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您指定的架构是 68000,它是一个 CPU,而不是 GPU 或 USB 控制器,或者其他可能访问缓存的硬件...
因此,您在 68000 上运行的代码将对问题“最少可能的未来缓存丢失/页面错误”。
在这方面,您区分了缓存未命中和页面错误,我不确定您到底指的是哪种coldfire架构,但我假设这没有硬件TLB替换,它使用软件机制(因此缓存将是与应用程序的数据共享)。
在替换策略中最重要的因素是关联(或方式)的数量。
直接映射缓存(1 路)直接与地址的低位相关(几乎总是如此)(位数指定缓存的大小),因此 32k 缓存将是较低的 15 位。在这种情况下,替换算法 LRU、FIFO 或 Random 将毫无用处,因为只有一种可能的选择。
然而,缓存的回写或直写选择会产生更大的效果。仅适用于写入内存 直写意味着高速缓存行未分配,与写回高速缓存相反,其中当前位于高速缓存中的共享相同低 15 位的行被从高速缓存中弹出并读回然后修改,以供使用 IF CPU 上运行的代码使用此数据)。
对于写入数据且不对数据执行多个操作的操作,直写通常要好得多,在现代处理器上也是如此(我不知道该架构是否支持它),但可以在 TLB/Page 上选择 Writethrough 或 Writeback基础。这对缓存的影响比策略大得多,您可以对系统进行编程以匹配每个页面中的数据类型,尤其是在直接地图缓存中;-)
因此直接地图缓存非常容易理解,它是也很容易理解缓存的最坏情况、最好情况和平均情况的基础。
想象一个 memcpy 例程,它复制与缓存大小对齐的数据。例如,一个 32k 直接映射缓存,两个 32k 缓冲区在 32k 边界上对齐......
在这里您可以看到复制每个数据字时的读取和写入,请注意每个读取和写入操作的低 15 位是相同的。
使用写回的直接映射缓存(记住写回分配行执行以下操作)
当您看到大量内存操作正在进行时,这实际上称为“颠簸”,并且是最坏情况场景的最佳示例。
现在想象一下我们使用 writethrough 缓存,这些是操作:
正如您所看到的巨大差异,我们现在不会崩溃,事实上我们是本示例中的最佳情况。
好的,这就是直写与回写的简单情况。
然而,直接映射缓存现在不太常见,大多数人使用 2,4 或 8 路缓存,即一行中有 2、4 或 8 个不同的可能分配。所以我们可以在4路或8路缓存中同时存储0x0000、0x8000、0x1000、0x1800(显然8路也可以存储0x2000、0x2800、0x3000、0x3800)。
这将避免这个颠簸问题。
只是为了澄清 32k 直接映射缓存中的行号是地址的底部 15 位。
在 32k 2 方式中,它是底部 14 位。
在 32k 4 方式中,它是底部 13 位。
在 32k 8 方式中,它是底部 12 位。
在全关联高速缓存中,它是行大小(或 32 字节行的底部 5 位)。你不能少于一行。 32 字节通常是 DDR 内存系统中最佳的操作(还有其他原因,有时 16 或有时 64 字节可能更好,并且在算法情况下 1 字节是最佳的,我们使用 32,因为这很常见)
为了帮助理解 LRU、FIFO 和随机,我们认为缓存是完全关联的,在 32k 32 字节行缓存中,这是 1024 行。
随机替换策略会随机导致最坏情况每 1024 个替换命中一次(即 99.9% 命中),在 LRU 或 FIFO 中我总是可以编写一个会“颠簸”的程序,即。总是会导致最坏情况的行为(即 0% 命中)。
显然,如果您有一个完全关联的缓存,那么只有在程序已知并且程序的确切行为已知的情况下,您才会选择 LRU 或 FIFO。
对于任何不是 99.9% 可预测的事情,你都会选择随机,它只是不最差的最好的,也是平均的最好的之一,但是最好的情况(我得到最好的性能)怎么样......
好吧基本上取决于方法的数量...
2 种方法,我可以优化诸如 memcpy 和其他算法之类的东西来做好工作。随机的时候有一半会出错。
4 种方式,当我在其他任务之间切换时,我可能不会对缓存造成太大破坏,以至于它们的数据仍然是本地的。随机有四分之一的时间会出错。
现在统计数据可以生效的 8 种方式 memcpy 上的 7/8% 命中率不如 1023/1024%(完全关联或优化的代码),但对于非优化的代码来说,它会产生影响。
那么为什么人们不使用随机替换策略来制作完全关联的缓存呢!
好吧,这并不是因为它们不能生成好的随机数,事实上伪随机数生成器也同样好,是的,我可以编写一个程序来获得 100% 的丢失率,但这不是重点,我不能编写一个 100% 未命中的有用程序,我可以使用 LRU 或 FIFO 算法来实现。
32k 32 字节行全关联高速缓存要求您比较 1024 个值,在硬件中这是通过 CAM 完成的,但这是一个昂贵的硬件,而且不可能在“快速”处理中比较这么多值时间,我想知道量子计算机是否可以……
无论如何,回答你的问题哪个更好:
参考文献:
You specify the architecture a 68000 which is a CPU rather than a GPU or USB controller, or other piece of hardware which may access a cache however...
Therefore the code you run on the 68000 will make a huge difference to the part of the question "least possible future cache miss'/page faults".
In this you differentiate between cache miss's and page faults, I'm not sure exactly which coldfire architecure you are refering to but I'm assuming this doesn't have a hardware TLB replacement, it uses a software mechansim (so the cache will be shared with the data of applications).
In the replacement policy the most important factor is the number of associations (or ways).
A direct map cache (1 way), directly correlates (all most always) with the low order bits of the address (the number of bits specify the size of the cache) so a 32k cache would be the lower 15bits. In this case the replacement algorthims LRU, FIFO or Random would be useless as there is only one possible choice.
However Writeback or Writethrough selection of the cache would have more effect. For writes to memory only Writethrough means the cache line isn't allocated as apposed to a writeback cache where the line currently in the cache which shares the same lower 15 bits is ejected out of the cache and read back in then modified, for use IF the code running on the CPU uses this data).
For operations that write and don't perform multiple operations on the data then writethrough is usually much better, also on modern processors (and I don' know if this architecture supports it), but Writethrough or Writeback can be selected on a TLB/Page basis. This can have a much greator effect on the cache than the policy, you can programme the system to match the type of data in each page, especially in a direct map cache ;-)
So a direct map cache is pretty easy to understand, it's also easy to understand the basis of cache's worst case, best case and average case.
Imagin a memcpy routine which copies data which is aligned to the cache size. For example a 32k direct mapped cache, with two 32k buffers aligned on a 32k boundry....
Here you see the read and write's as they copy each word of data, notice the lower 15 bits are the same for each read and write operation.
A direct mapped cache using writeback (remember writeback allocates lines does the following)
As you see a lot of memory operations go on, this in fact is called "thrashing" and is the best example of a worst case scenairo.
Now imagine that we use a writethrough cache, these are the operations:
As you can see a huge difference we now don't thrash, in fact we are best case in this example.
Ok so that's the simple case of write through vs. write back.
Direct map caches however are now not very common most people use, 2,4 or 8 way cache's, that is there are 2, 4 or 8 different possible allocations in a line. So we could store 0x0000, 0x8000, 0x1000, 0x1800 all in the cache at the same time in a 4 way or 8 way cache (obviously an 8 way could also store 0x2000, 0x2800, 0x3000, 0x3800 as well).
This would avoid this thrashing issue.
Just to clarify the line number in a 32k direct mapped cache is the bottom 15 bits of the address.
In a 32k 2 way it's the bottom 14 bits.
In a 32k 4 way it's the bottom 13 bits.
In a 32k 8 way it's the bottom 12 bits.
And in a fully associatve cache it's the lines size (or the bottom 5 bits with a 32 byte line). You can't have less than a line. 32 bytes is typically the most optimum operation in a DDR memory system (there are other reasons as well, someimes 16 or sometimes 64 bytes can be better, and 1 byte would be optimal in the algorthmic case, lets uses 32 as that's very common)
To help understand LRU, FIFO and Random consider the cache is full associative, in a 32k 32 byte line cache this is 1024 lines.
A random replacement policy would on random cause a worst case hit every 1024 replacements (ie. 99.9% hit), in either LRU or FIFO I could always write a programme which would "thrash" ie. always cause a worst case behavouir (ie. 0% hit).
Clearly if you had a fully associative cache you would only choose LRU or FIFO if the programme was known and it was known the exact behavouir of the programme.
For ANYTHING which wasn't 99.9% predictable you would choose RANDOM, it's simply the best at not being worst, and one of the best for being average, but how about best case (where I get the best performance)...
Well it depends basically on the number of ways...
2 ways and I can optimise things like memcpy and other algorthims to do a good job. Random would get it wrong half the time.
4 ways and when I switch between other tasks I might not damage the cache so much that their data is still local. Random would get it wrong quater of the time.
8 ways now statistics can take effect a 7/8% hit rate on a memcpy isn't as good as a 1023/1024% (fully associative or optimised code) but for non optimised code it makes a difference.
So why don't people make fully associative cache's with random replacement policies!
Well it's not because they can't generate good random numbers, in fact a pseudo random number generator is just as good and yes I can write a programme to get 100% miss rate, but that isn't the point, I couldn't write a useful programme that would have 100% miss, which I could with an LRU or FIFO algo.
A 32k 32 byte line Fully associatve cache's require you to compare 1024 values, in hardware this is done through a CAM, but this is an expensive piece of hardware, and also it's just not possible to compare this many values in a "FAST" processing time, I wonder if a quantum computer could....
Anyway to answer your question which one is better:
References:
不存在完美的缓存策略,因为它需要了解未来(程序如何访问内存)。
但是,在常见的内存访问模式情况下,有些明显优于其他。 LRU 就是这种情况。 LRU 历史上在整体使用中给出了非常好的性能。
但是,对于您想要做的事情,另一种政策可能会更好。总有一些内存访问模式会导致缓存策略表现不佳。
您可能会发现此线程很有帮助(而且更详细!)
为什么 LRU 比 FIFO 更好?
No perfect caching policy exists because it would require knowledge of the future (how a program will access memory).
But, some are measurably better than others in the common memory access pattern cases. This is the case with LRU. LRU has historically given very good performance in overall use.
But, for what you are trying to do, another policy may be better. There is always some memory access pattern that will cause a caching policy to perform poorly.
You may find this thread helpful (and more elaborate!)
Why is LRU better than FIFO?
在这三者之间,我推荐 LRU。首先,当假设局部性时,这是对最优调度的一个很好的近似(事实证明这是一个很好的假设)。随机调度不能从局部性中受益。其次,它不会受到 Belady 异常的影响(如 FIFO);也就是说,更大的缓存意味着更好的性能,但对于 FIFO 来说不一定如此。
只有当您的特定问题域强烈建议使用其他东西时,LRU 在一般情况下才很难被击败。
Between the three, I'd recommend LRU. First, it's a good approximation to optimal scheduling when locality is assumed (this turns out to be a good assumption). Random scheduling cannot benefit from locality. Second, it doesn't suffer from Belady's anomaly (like FIFO); that is, bigger caches mean better performance, which isn't necessarily true with FIFO.
Only if your specific problem domain strongly suggests using something else, LRU is going to be hard to beat in the general case.
在这三者中,LRU 通常是最好的,而 FIFO 是最差的,而随机则介于两者之间。您可以构建三种访问模式,其中任何一种都优于其他任何一种,但这有些棘手。有趣的是,这个顺序也粗略地表示了它们的实现成本——LRU 是最昂贵的,而 FIFO 是最便宜的。只是去表演,天下没有免费的午餐
Of the three, LRU is generally the best while FIFO is the worst and random comes in somewhere between. You can construct access patterns where any of the three is superior to either of the others, but it is somewhat tricky. Interestingly enough, this order is also roughly how expensive they are to implement -- LRU is the most expensive and FIFO is the cheapest. Just goes to show, there's no free lunch
我研究过的许多架构都使用 LRU,因为它通常不仅提供实现效率,而且平均而言在防止遗漏方面也相当出色。然而,在最新的 x86 架构中,我认为发生了一些更复杂的事情。 LRU 是一种基本模型。
这实际上取决于您在设备上执行的操作类型。根据作业类型的不同,不同的疏散策略会发挥更好的作用。例如,FIFO 可以很好地按顺序遍历内存。
Many of the architectures I have studied use LRU, as it generally provides not only efficiency in implementation, but also is pretty good on average at preventing misses. However, in the latest x86 architectures, I think there are some more complicated things going on. LRU is sort of a basic model.
It really depends on what kind of operations you are performing on your device. Depending on the types of operations, different evacuation policies will work better. For example, FIFO works well with traversing memory sequentially.
如果您想要两全其美,请考虑采用自适应方法,根据实际使用模式改变策略。例如,查看 IBM 的算法自适应替换缓存:http://code.activestate.com/recipes/576532-adaptive-replacement-cache-in-python/
If you want the best of both worlds, consider an adaptive approach that alters it strategy based on actual usage patterns. For example, look at the algorith IBM's Adaptive Replacement Cache: http://code.activestate.com/recipes/576532-adaptive-replacement-cache-in-python/