Scylla如何从其缓存中驱逐数据？

发布于 2025-01-29 07:47:18 字数 717 浏览 3 评论 0原文

Scylla如何确定何时从其缓存中驱逐数据？例如，假设表t具有以下结构：

K1 C1 V1 V2 V3

我用500行填充上表（例如，查询select *从t中的k1 = x＆amp; c1 = y返回500行）。

一段时间后，我将一个新行插入上表中，这会导致上述查询返回501行，而不是500行。

Scylla是否知道从其缓存中自动驱逐500行，或者至少将行501添加到其缓存中？如果没有，大多数查询将迅速开始返回过时的数据。同样，如果我不在数据库中添加新行，会发生什么，而不是我更新现有的500行之一 。 Scylla是否知道这种修改并能够自动更新其缓存？如果是，它是否足够聪明，只是更新更改的数据（新行或已修改的行）还是驱逐/更新所有500行？

是否有任何情况需要知道在sstables中更新数据但在内存中不更新的情况？

谢谢

P.S，

我读了很多有关在Scylla 中如何工作但是我没有看到上述问题的明确答案。如果Scylla确实知道背景更新，我也很好奇它如何实现其缓存的动态和智能更新。

原文

How does Scylla determine when to evict data from its cache? For example, suppose table T has the following structure:

K1 C1 V1 V2 V3

I populate the above table with 500 rows (e.g, the query SELECT * from T WHERE K1 = X & C1 = Y returns 500 rows).

Some time later I insert a new row into the above table that would cause the above query to return 501 rows, instead of 500 rows.

Does Scylla know to automatically evict the 500 rows from its cache or at least to add row 501 to its cache? If not, most queries will quickly start returning outdated data. Similarly, what happens if I don’t add a new row to the database, rather I update one of the existing 500 rows. Is Scylla aware of this modification and capable of updating its cache automatically? If yes, is it smart enough only to update the data that changed (the new row or the row that was modified) or does it evict/update all 500 rows?

Are there any cases to be aware of where data is updated in SSTables but not in memory?

Thanks

P.S

I read a lot about how caching works in Scylla but I didn’t see a clear answer to the above question. If Scylla is indeed aware of background updates I would also be curious to learn HOW it achieves such dynamic and intelligent updating of its cache.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

奈何桥上唱咆哮 2025-02-05 07:47:18

我认为您误解了缓存在Scylla或任何数据库中所做的工作。

行缓存，顾名思义，caches（即保持内存）单个行 - 不是整个请求的结果。因此，返回500行的请求的事实并不意味着下一次请求将出现Scylla将返回相同的500行。一点也不。让我尝试解释发生了什么，尽管这也已在其他地方进行了记录，我也将简化一些细节，希望能说明：

当Scylla节点启动时，所有数据都位于磁盘上（存储在文件中 sstables ），内存中没有任何东西。当用户要求读取尚未在内存中缓存中尚未在内存中的特定行时，该行将从磁盘中读取，然后存储在缓存中。如果用户后来再次读取同一行，则立即从缓存返回。如果用户将写入此行，则该行在缓存和磁盘上进行了更新（详细信息稍微复杂一些，还有一个内存表 - memtable - 但是我正在尝试简化）。缓存始终是最新的 - 如果一行出现在其中，那是正确的。当然，它也可能不会出现。

您在问题文本中描述的情况（尽管不是您发布的实际查询！）大约是一个分区切片的扫描，不是返回一个行，而是返回许多行（500或501）。 Scylla需要（并且确实）做更多的工作以正确处理此情况：

当第一次完成特定范围的扫描时，Scylla将这500行读取在该范围内，并将它们都放在行中缓存。但是它还记得在该范围内的缓存是连续 - 这500行是此范围内的一切。因此，当用户再次尝试相同的查询时，缓存不需要检查500之间是否有其他行 - 知道没有。如果以后在此范围内编写501st行，则将此行添加到缓存中，该行知道它保持连续，因此该范围的下一个扫描将返回501行。 Scylla确实需要驱逐500行，因为仅将一个行添加到同一分区中。

如果在以后的某个时候，Scylla的内存不足并需要从缓存中驱逐一些行，则可能决定从缓存中驱逐所有这501行 - 或其中一些行。如果它驱逐了其中的一些，它会失去连续性 - 如果仅记得原始范围的400行，如果用户要求扫描Scylla（再次，简化一些细节）以读取所有行以读取所有行以读取所有行范围从磁盘，因为它不知道该范围内缺少哪个特定行。

I think you are misunderstanding what the cache does in Scylla, or any database for that matter.

The row cache, as its name suggests, caches (i.e., keeps in memory) individual rows - not the results of entire requests. So the fact that a request at one point returned 500 rows does not mean that the next time this request will come Scylla will return the same 500 rows. Not at all. Let me try to explain what does happen, although this is also documented elsewhere and I'll also simplify some details to hopefully get the point across:

When a Scylla node boots up, all the data is located on disk (stored in files known as sstables) and nothing is in memory. When a user asks to read one specific row that is not already in the in-memory cache, this row is read from disk and then stored in the cache. If the user later reads the same row again, it is returned from cache immediately. If the user writes to this row, the row is updated in the cache as well as on disk (the details are slightly more complicated, there is also an in-memory table - memtable - but I'm trying to simplify). The cache is always up-to-date - if a row appears in it, it is correct. Of course it also may not appear in it.

The situation you describe in your question's text (although not the actual query you posted!) is about a scan of a slice of a partition, returning not one but many rows (500 or 501). Scylla needs to (and does) put in a bit more work to handle this case correctly:

When the scan of a certain range is done for the first time, Scylla reads those 500 rows in that range, and puts each of them in the row cache. But it also remembers that the cache is contiguous in that range - these 500 rows are everything that exists in this range. So when the user tries the same query again, the cache doesn't need to check if maybe there are additional rows between those 500 - it knows there aren't. If you later write a 501st row inside this range, this row is added to the cache, which knows it remained contiguous, so the next scan of this range will return 501 rows. Scylla does not need to evict the 500 rows just because one was added to the same partition.

If at some later point in time Scylla runs out of memory and needs to evict some rows from the cache, it may decide to evict all these 501 rows from the cache - or some of them. If it evicts some of them, it loses continuity - if it only remembers, say, 400 rows for the original range, if the user asks to scan that range again Scylla is forced (again, simplifying some details) to read all the rows in the range from disk, because it has no idea which specific rows it is missing in this range.

回复收藏 0 原文

~没有更多了~