缓存失效— 有通用的解决方案吗?
“计算机科学中只有两个难题:缓存失效和命名。”
Phil Karlton
是否有使缓存失效的通用解决方案或方法? 知道条目何时过时,以便保证您始终获得最新数据?
例如,考虑一个从文件获取数据的函数getData()
。 它根据文件的上次修改时间对其进行缓存,每次调用时都会检查该时间。
然后添加第二个函数 transformData()
来转换数据,并缓存其结果以供下次调用该函数时使用。 它不知道该文件 - 如何添加依赖项:如果文件发生更改,该缓存将变得无效?
您可以在每次调用 transformData()
时调用 getData()
,并将其与用于构建缓存的值进行比较,但这最终可能会导致非常昂贵的成本。
"There are only two hard problems in Computer Science: cache invalidation and naming things."
Phil Karlton
Is there a general solution or method to invalidating a cache; to know when an entry is stale, so you are guaranteed to always get fresh data?
For example, consider a function getData()
that gets data from a file.
It caches it based on the last modified time of the file, which it checks every time it's called.
Then you add a second function transformData()
which transforms the data, and caches its result for next time the function is called. It has no knowledge of the file - how do you add the dependency that if the file is changed, this cache becomes invalid?
You could call getData()
every time transformData()
is called and compare it with the value that was used to build the cache, but that could end up being very costly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
恕我直言,函数反应式编程(FRP)在某种意义上是解决缓存失效的通用方法。
原因如下:FRP 术语中的过时数据称为故障。 FRP 的目标之一是保证不出现故障。
在“FRP 的本质”演讲和< a href="https://stackoverflow.com/a/28247944/1198559">所以答案。
在谈话中,
Cell
代表缓存的对象/如果实体和Cell
的依赖项之一被刷新,则该实体和 Cell 也会被刷新。FRP 隐藏与依赖图关联的管道代码,并确保不存在过时的 Cell。
我能想到的另一种方法(与 FRP 不同)是将计算值(
b
类型)包装到某种 writer MonadWriter (Set (uuid)) b
其中Set (uuid)
(Haskell 表示法)包含计算值b
所依赖的可变值的所有标识符。 因此,uuid 是某种唯一标识符,用于标识计算的 b 所依赖的可变值/变量(例如数据库中的一行)。将此想法与在此类编写器 Monad 上运行的组合器相结合,如果您仅使用这些组合器来计算新的
b
,则可能会导致某种通用缓存失效解决方案。 此类组合器(例如filter
的特殊版本)将 Writer monad 和(uuid, a)
-s 作为输入,其中a
是可变的数据/变量,由uuid
标识。因此,每次更改“原始”数据
(uuid, a)
(例如计算b
的数据库中的标准化数据)时,类型的计算值将基于该数据b
取决于,如果您改变计算b
的任何值a
,您可以使包含b
的缓存失效值取决于,因为根据 Writer Monad 中的Set (uuid)
,您可以判断何时发生这种情况。因此,每当您使用给定的 uuid 改变某些内容时,您都会将此突变广播到所有缓存,并且它们会使依赖于用所述 < 标识的可变值的值
b
失效。 code>uuid 因为包装了b
的 Writer monad 可以判断b
是否依赖于所述uuid
。当然,只有当你阅读的次数多于写作的次数时,这才会有回报。
第三种实用方法是在数据库中使用物化视图并将它们用作缓存。 据我所知,他们还旨在解决无效问题。 这当然限制了将可变数据连接到派生数据的操作。
IMHO, Functional Reactive Programming (FRP) is in a sense a general way to solve cache invalidation.
Here is why: stale data in FRP terminology is called a glitch. One of FRP's goals is to guarantee absence of glitches.
FRP is explained in more detail in this 'Essence of FRP' talk and in this SO answer.
In the talk the
Cell
s represent a cached Object/Entity and aCell
is refreshed if one of it's dependency is refreshed.FRP hides the plumbing code associated with the dependency graph and makes sure that there are no stale
Cell
s.Another way (different from FRP) that I can think of is wrapping the computed value (of type
b
) into some kind of a writer MonadWriter (Set (uuid)) b
whereSet (uuid)
(Haskell notation) contains all the identifiers of the mutable values on which the computed valueb
depends. So,uuid
is some kind of a unique identifier that identifies the mutable value/variable (say a row in a database) on which the computedb
depends.Combine this idea with combinators that operate on this kind of writer Monad and that might lead to some kind of a general cache invalidation solution if you only use these combinators to calculate a new
b
. Such combinators (say a special version offilter
) take Writer monads and(uuid, a)
-s as inputs, wherea
is a mutable data/variable, identified byuuid
.So every time you change the "original" data
(uuid, a)
(say the normalized data in a database from whichb
was computed) on which the computed value of typeb
depends then you can invalidate the cache that containsb
if you mutate any valuea
on which the computedb
value depends, because based on theSet (uuid)
in the Writer Monad you can tell when this happens.So anytime you mutate something with a given
uuid
, you broadcast this mutation to all the cache-s and they invalidate the valuesb
that depend on the mutable value identified with saiduuid
because the Writer monad in which theb
is wrapped can tell if thatb
depends on saiduuid
or not.Of course, this only pays off if you read much more often than you write.
A third, practical, approach is to use materialized view-s in databases and use them as cache-es. AFAIK they also aim to solve the invalidation problem. This of course limits the operations that connect the mutable data to the derived data.
我现在正在研究一种基于 PostSharp 和 记忆功能。 我已经把它交给了我的导师,他也同意这是一个与内容无关的缓存的良好实现。
每个函数都可以用指定其有效期的属性来标记。 以这种方式标记的每个函数都会被记忆,并将结果存储到缓存中,并以函数调用和参数的哈希值作为键。 我使用 Velocity 作为后端,它处理缓存数据。
I'm working on an approach right now based on PostSharp and memoizing functions. I've run it past my mentor, and he agrees that it's a good implementation of caching in a content-agnostic way.
Every function can be marked with an attribute that specifies its expiry period. Each function marked in this way is memoized and the result is stored into the cache, with a hash of the function call and parameters used as the key. I'm using Velocity for the backend, which handles distribution of the cache data.
如果您每次进行转换时都要调用 getData(),那么您就已经消除了缓存的全部优势。
对于您的示例,解决方案似乎是当您生成转换后的数据时,还存储生成数据的文件的文件名和上次修改时间(您已经将其存储在 getData( 返回的任何数据结构中) ),因此您只需将该记录复制到由transformData()返回的数据结构中,然后当您再次调用transformData()时,检查文件的上次修改时间。
If you're going to getData() every time you do the transform, then you've eliminated the entire benefit of the cache.
For your example, it seems like a solution would be for when you generate the transformed data, to also store the filename and last modified time of the file the data was generated from (you already stored this in whatever data structure was returned by getData(), so you just copy that record into the data structure returned by transformData()) and then when you call transformData() again, check the last modified time of the file.
不,因为所有数据都是不同的。 有些数据可能在一分钟后就“过时”,有些数据在一小时后就“过时”了,有些数据可能在几天或几个月内都没有问题。
关于您的具体示例,最简单的解决方案是为文件提供“缓存检查”功能,您可以从
getData
和transformData
调用该功能。No, because all data is different. Some data may be "stale" after a minute, some after an hour, and some may be fine for days or months.
Regarding your specific example, the simplest solution is to have a 'cache checking' function for files, which you call from both
getData
andtransformData
.没有通用的解决方案,但是:
您的缓存可以充当代理(拉)。 假设您的缓存知道上次源更改的时间戳,当有人调用
getData()
时,缓存会向源询问上次更改的时间戳,如果相同,则返回缓存,否则将更新其内容源并返回其内容。 (一种变体是客户端直接发送请求的时间戳,源仅在时间戳不同时返回内容。)您仍然可以使用通知过程(推送),缓存观察源,如果源发生变化时,它会向缓存发送通知,然后将其标记为“脏”。 如果有人调用 getData() ,缓存将首先更新到源,删除“脏”标志; 然后返回其内容。
一般来说,选择取决于:
getData()
的许多调用更喜欢推送,以避免源被 getTimestamp 函数淹没注意:由于使用时间戳是 http 代理的传统工作方式,因此另一种方法是共享存储内容的哈希值。 我知道两个实体一起更新的唯一方法是我打电话给你(拉)或者你打电话给我......(推)仅此而已。
There is no general solution but:
You cache can act as a proxy (pull). Assume your cache knows the last origin change's timestamp, when someone call
getData()
, the cache ask the origin for it's last change's timestamp, if the same, it returns the cache, otherwise it updates its content with the source one and return its content. (A variation is the client to directly send the timestamp on the request, the source would only return content if its timestamp is different.)You can still use a notification process (push), the cache observe the source, if the source changes, it sends a notification to the cache which is then flagged as "dirty". If someone calls
getData()
the cache will first get updated to the source, remove the "dirty" flag; then return its content.The choice generally speaking depends on:
getData()
would prefer a push so to avoid the source to be flooded by a getTimestamp functionNote: As using the timestamp is the traditional way http proxies are working, another approach is sharing a hash of the content stored. The only way I know for 2 entities to get updated together are either I call you (pull) or you call me… (push) that's all.
缓存很难,因为你需要考虑:
1)缓存是多个节点,需要它们达成共识
2)失效时间
3)当多个获取/设置发生时的竞争条件
这是一个很好的阅读:
https://www.confluence.io /blog/用-apache-samza转动数据库内向外/
cache is hard because you need to consider:
1)the cache is multiple nodes,need consensus for them
2)invalidation time
3)race condition when multple get/set happen
this is good reading:
https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
也许缓存无关的算法将是最通用的(或者至少,较少依赖硬件配置),因为它们将首先使用最快的缓存,然后从那里继续。 这是关于它的麻省理工学院讲座:缓存不经意的算法
Perhaps cache-oblivious algorithms would be the most general (Or at least, less hardware configuration dependent), since they'll use the fastest cache first and move on from there. Here's a MIT lecture on it: Cache Oblivious Algorithms
您所说的是生命周期依赖链,即一件事依赖于另一件事,而另一件事可以在其控制之外进行修改。
如果您有一个从
a
、b
到c
的幂等函数,其中,如果a
和b< /code> 是相同的,那么
c
是相同的,但是检查b
的成本很高,那么您要么:b
b
你不能鱼与熊掌兼得...
如果你可以基于
b
分层一个额外的缓存code>a 超出了顶部,那么这对最初的问题影响不是一点点。 如果您选择 1,那么您拥有自己给予自己的任何自由,因此可以缓存更多内容,但必须记住考虑b
缓存值的有效性。 如果您选择 2,则每次仍必须检查b
,但如果b
签出,则可以退回到a
的缓存。如果对缓存进行分层,则必须考虑组合行为是否违反了系统的“规则”。
如果您知道
a
始终有效(如果b
有效),那么您可以像这样安排缓存(伪代码):显然是连续分层(例如
x
) 是微不足道的,只要在每个阶段新添加的输入的有效性与x
:的
和a
:b
关系匹配即可bx
:a
。然而,您很可能获得三个输入,其有效性完全独立(或循环),因此不可能进行分层。 这意味着标记为 // important 的行必须更改为
What you are talking about is lifetime dependency chaining, that one thing is dependent on another which can be modified outside of it's control.
If you have an idempotent function from
a
,b
toc
where, ifa
andb
are the same thenc
is the same but the cost of checkingb
is high then you either:b
b
as fast as possibleYou cannot have your cake and eat it...
If you can layer an additional cache based on
a
over the top then this affects the initial problem not one bit. If you chose 1 then you have whatever freedom you gave yourself and can thus cache more but must remember to consider the validity of the cached value ofb
. If you chose 2 you must still checkb
every time but can fall back on the cache fora
ifb
checks out.If you layer caches you must consider whether you have violated the 'rules' of the system as a result of the combined behaviour.
If you know that
a
always has validity ifb
does then you can arrange your cache like so (pseudocode):Obviously successive layering (say
x
) is trivial so long as, at each stage the validity of the newly added input matches thea
:b
relationship forx
:b
andx
:a
.However it is quite possible that you could get three inputs whose validity was entirely independent (or was cyclic), so no layering would be possible. This would mean the line marked // important would have to change to
缓存失效的问题在于,内容发生了变化,而我们却浑然不知。 因此,在某些情况下,如果有其他事物确实了解该问题并可以通知我们,那么解决方案是可能的。 在给定的示例中,getData 函数可以挂接到文件系统,该文件系统确实知道对文件的所有更改,无论哪个进程更改了文件,并且该组件反过来可以通知转换数据的组件。
我认为没有任何通用的神奇修复方法可以解决这个问题。 但在许多实际情况下,很可能有机会将基于“轮询”的方法转变为基于“中断”的方法,这可以使问题消失。
The problem in cache invalidation is that stuff changes without us knowing about it. So, in some cases, a solution is possible if there is some other thing that does know about it and can notify us. In the given example, the getData function could hook into the file system, which does know about all changes to files, regardless of what process changes the file, and this component in turn could notify the component that transforms the data.
I don't think there is any general magic fix to make the problem go away. But in many practical cases there may very well be opportunities to transform a "polling"-based approach into an "interrupt"-based one, which can make the problem simply go away.