缓存失效— 有通用的解决方案吗？

发布于 2024-07-29 01:12:11 字数 451 浏览 16 评论 0原文

“计算机科学中只有两个难题：缓存失效和命名。”

Phil Karlton

是否有使缓存失效的通用解决方案或方法？知道条目何时过时，以便保证您始终获得最新数据？

例如，考虑一个从文件获取数据的函数getData()。它根据文件的上次修改时间对其进行缓存，每次调用时都会检查该时间。
然后添加第二个函数 transformData() 来转换数据，并缓存其结果以供下次调用该函数时使用。它不知道该文件 - 如何添加依赖项：如果文件发生更改，该缓存将变得无效？

您可以在每次调用 transformData() 时调用 getData()，并将其与用于构建缓存的值进行比较，但这最终可能会导致非常昂贵的成本。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

流年已逝 2024-08-05 01:12:12

恕我直言，函数反应式编程（FRP）在某种意义上是解决缓存失效的通用方法。

原因如下：FRP 术语中的过时数据称为故障。 FRP 的目标之一是保证不出现故障。

在“FRP 的本质”演讲和< a href="https://stackoverflow.com/a/28247944/1198559">所以答案。

在谈话中，Cell代表缓存的对象/如果实体和 Cell 的依赖项之一被刷新，则该实体和 Cell 也会被刷新。

FRP 隐藏与依赖图关联的管道代码，并确保不存在过时的 Cell。

我能想到的另一种方法（与 FRP 不同）是将计算值（b 类型）包装到某种 writer Monad Writer (Set (uuid)) b 其中Set (uuid)（Haskell 表示法）包含计算值b 所依赖的可变值的所有标识符。因此，uuid 是某种唯一标识符，用于标识计算的 b 所依赖的可变值/变量（例如数据库中的一行）。

将此想法与在此类编写器 Monad 上运行的组合器相结合，如果您仅使用这些组合器来计算新的 b，则可能会导致某种通用缓存失效解决方案。此类组合器（例如 filter 的特殊版本）将 Writer monad 和 (uuid, a)-s 作为输入，其中 a 是可变的数据/变量，由uuid标识。

因此，每次更改“原始”数据 (uuid, a) （例如计算 b 的数据库中的标准化数据）时，类型的计算值将基于该数据b 取决于，如果您改变计算 b 的任何值 a，您可以使包含 b 的缓存失效值取决于，因为根据 Writer Monad 中的 Set (uuid)，您可以判断何时发生这种情况。

因此，每当您使用给定的 uuid 改变某些内容时，您都会将此突变广播到所有缓存，并且它们会使依赖于用所述 < 标识的可变值的值 b 失效。 code>uuid 因为包装了 b 的 Writer monad 可以判断 b 是否依赖于所述 uuid 。

当然，只有当你阅读的次数多于写作的次数时，这才会有回报。

第三种实用方法是在数据库中使用物化视图并将它们用作缓存。据我所知，他们还旨在解决无效问题。这当然限制了将可变数据连接到派生数据的操作。

IMHO, Functional Reactive Programming (FRP) is in a sense a general way to solve cache invalidation.

Here is why: stale data in FRP terminology is called a glitch. One of FRP's goals is to guarantee absence of glitches.

FRP is explained in more detail in this 'Essence of FRP' talk and in this SO answer.

In the talk the Cells represent a cached Object/Entity and a Cell is refreshed if one of it's dependency is refreshed.

FRP hides the plumbing code associated with the dependency graph and makes sure that there are no stale Cells.

Another way (different from FRP) that I can think of is wrapping the computed value (of type b) into some kind of a writer Monad Writer (Set (uuid)) b where Set (uuid) (Haskell notation) contains all the identifiers of the mutable values on which the computed value b depends. So, uuid is some kind of a unique identifier that identifies the mutable value/variable (say a row in a database) on which the computed b depends.

Combine this idea with combinators that operate on this kind of writer Monad and that might lead to some kind of a general cache invalidation solution if you only use these combinators to calculate a new b. Such combinators (say a special version of filter) take Writer monads and (uuid, a)-s as inputs, where a is a mutable data/variable, identified by uuid.

So every time you change the "original" data (uuid, a) (say the normalized data in a database from which b was computed) on which the computed value of type b depends then you can invalidate the cache that contains b if you mutate any value a on which the computed b value depends, because based on the Set (uuid) in the Writer Monad you can tell when this happens.

So anytime you mutate something with a given uuid, you broadcast this mutation to all the cache-s and they invalidate the values b that depend on the mutable value identified with said uuid because the Writer monad in which the b is wrapped can tell if that b depends on said uuid or not.

Of course, this only pays off if you read much more often than you write.

A third, practical, approach is to use materialized view-s in databases and use them as cache-es. AFAIK they also aim to solve the invalidation problem. This of course limits the operations that connect the mutable data to the derived data.

回复收藏 0 原文

尐籹人 2024-08-05 01:12:12

我现在正在研究一种基于 PostSharp 和记忆功能。我已经把它交给了我的导师，他也同意这是一个与内容无关的缓存的良好实现。

每个函数都可以用指定其有效期的属性来标记。以这种方式标记的每个函数都会被记忆，并将结果存储到缓存中，并以函数调用和参数的哈希值作为键。我使用 Velocity 作为后端，它处理缓存数据。

回复收藏 0 原文

骄傲 2024-08-05 01:12:12

如果您每次进行转换时都要调用 getData()，那么您就已经消除了缓存的全部优势。

对于您的示例，解决方案似乎是当您生成转换后的数据时，还存储生成数据的文件的文件名和上次修改时间（您已经将其存储在 getData( 返回的任何数据结构中）），因此您只需将该记录复制到由transformData（）返回的数据结构中，然后当您再次调用transformData（）时，检查文件的上次修改时间。

回复收藏 0 原文

与往事干杯 2024-08-05 01:12:12

是否有通用的解决方案或方法来创建缓存，以了解条目何时过时，从而保证您始终获得新鲜数据？

不，因为所有数据都是不同的。有些数据可能在一分钟后就“过时”，有些数据在一小时后就“过时”了，有些数据可能在几天或几个月内都没有问题。

关于您的具体示例，最简单的解决方案是为文件提供“缓存检查”功能，您可以从 getData 和 transformData 调用该功能。

回复收藏 0 原文

紫﹏色ふ单纯 2024-08-05 01:12:12

没有通用的解决方案，但是：

您的缓存可以充当代理（拉）。假设您的缓存知道上次源更改的时间戳，当有人调用 getData() 时，缓存会向源询问上次更改的时间戳，如果相同，则返回缓存，否则将更新其内容源并返回其内容。（一种变体是客户端直接发送请求的时间戳，源仅在时间戳不同时返回内容。）
您仍然可以使用通知过程（推送），缓存观察源，如果源发生变化时，它会向缓存发送通知，然后将其标记为“脏”。如果有人调用 getData() ，缓存将首先更新到源，删除“脏”标志；然后返回其内容。

一般来说，选择取决于：

频率：对 getData() 的许多调用更喜欢推送，以避免源被 getTimestamp 函数淹没
您对源的访问权限：您是否拥有该源模型？如果没有，您很可能无法添加任何通知流程。

注意：由于使用时间戳是 http 代理的传统工作方式，因此另一种方法是共享存储内容的哈希值。我知道两个实体一起更新的唯一方法是我打电话给你（拉）或者你打电话给我......（推）仅此而已。

回复收藏 0 原文

天赋异禀 2024-08-05 01:12:12

缓存很难，因为你需要考虑：
1）缓存是多个节点，需要它们达成共识
2）失效时间
3）当多个获取/设置发生时的竞争条件

这是一个很好的阅读：
https://www.confluence.io /blog/用-apache-samza转动数据库内向外/

回复收藏 0 原文

山川志 2024-08-05 01:12:12

也许缓存无关的算法将是最通用的（或者至少，较少依赖硬件配置），因为它们将首先使用最快的缓存，然后从那里继续。这是关于它的麻省理工学院讲座：缓存不经意的算法

回复收藏 0 原文

巷子口的你 2024-08-05 01:12:11

您所说的是生命周期依赖链，即一件事依赖于另一件事，而另一件事可以在其控制之外进行修改。

如果您有一个从 a、b 到 c 的幂等函数，其中，如果 a 和 b< /code> 是相同的，那么 c 是相同的，但是检查 b 的成本很高，那么您要么：

接受您有时使用过时的信息进行操作，而不是总是检查 b
尽你所能，尽可能快地检查 b

你不能鱼与熊掌兼得...

如果你可以基于 b 分层一个额外的缓存code>a 超出了顶部，那么这对最初的问题影响不是一点点。如果您选择 1，那么您拥有自己给予自己的任何自由，因此可以缓存更多内容，但必须记住考虑 b 缓存值的有效性。如果您选择 2，则每次仍必须检查 b，但如果 b 签出，则可以退回到 a 的缓存。

如果对缓存进行分层，则必须考虑组合行为是否违反了系统的“规则”。

如果您知道 a 始终有效（如果 b 有效），那么您可以像这样安排缓存（伪代码）：

private map<b,map<a,c>> cache // 
private func realFunction    // (a,b) -> c

get(a, b) 
{
    c result;
    map<a,c> endCache;
    if (cache[b] expired or not present)
    {
        remove all b -> * entries in cache;   
        endCache = new map<a,c>();      
        add to cache b -> endCache;
    }
    else
    {
        endCache = cache[b];     
    }
    if (endCache[a] not present)     // important line
    {
        result = realFunction(a,b); 
        endCache[a] = result;
    }
    else   
    {
        result = endCache[a];
    }
    return result;
}

显然是连续分层（例如 x ) 是微不足道的，只要在每个阶段新添加的输入的有效性与 x:的 a:b 关系匹配即可b 和 x:a。

然而，您很可能获得三个输入，其有效性完全独立（或循环），因此不可能进行分层。这意味着标记为 // important 的行必须更改为

if (endCache[a] 过期或不存在)

What you are talking about is lifetime dependency chaining, that one thing is dependent on another which can be modified outside of it's control.

If you have an idempotent function from a, b to c where, if a and b are the same then c is the same but the cost of checking b is high then you either:

accept that you sometime operate with out of date information and do not always check b
do your level best to make checking b as fast as possible

You cannot have your cake and eat it...

If you can layer an additional cache based on a over the top then this affects the initial problem not one bit. If you chose 1 then you have whatever freedom you gave yourself and can thus cache more but must remember to consider the validity of the cached value of b. If you chose 2 you must still check b every time but can fall back on the cache for a if b checks out.

If you layer caches you must consider whether you have violated the 'rules' of the system as a result of the combined behaviour.

If you know that a always has validity if b does then you can arrange your cache like so (pseudocode):

private map<b,map<a,c>> cache // 
private func realFunction    // (a,b) -> c

get(a, b) 
{
    c result;
    map<a,c> endCache;
    if (cache[b] expired or not present)
    {
        remove all b -> * entries in cache;   
        endCache = new map<a,c>();      
        add to cache b -> endCache;
    }
    else
    {
        endCache = cache[b];     
    }
    if (endCache[a] not present)     // important line
    {
        result = realFunction(a,b); 
        endCache[a] = result;
    }
    else   
    {
        result = endCache[a];
    }
    return result;
}

Obviously successive layering (say x) is trivial so long as, at each stage the validity of the newly added input matches the a:b relationship for x:b and x:a.

However it is quite possible that you could get three inputs whose validity was entirely independent (or was cyclic), so no layering would be possible. This would mean the line marked // important would have to change to