Cassandra TWCS 中的 SSTable 重叠是什么?

发布于 2025-01-12 10:21:31 字数 265 浏览 1 评论 0原文

我试图理解 cassandra 中的 SStable 重叠,它不适合 TWCS。我找到了类似的参考文献 https://thelastpickle.com/blog/2016/ 12/08/TWCS-part1.html 但我仍然不明白重叠是什么意思以及它是如何由读取修复引起的。谁能提供一个简单的例子来帮助我理解?谢谢

I am trying to understand SStable overlaps in cassandra which is not suitable for TWCS. I found references like https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html but I still don't understand what overlap means and how it is caused by read repairs. Can anyone please provide a simple example that would help me to understand? Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

无名指的心愿 2025-01-19 10:21:31

对于 TWCS,数据被压缩到“时间窗口”中。如果您配置了 1 小时的时间窗口,TWCS 会将一小时窗口内写入的所有分区压缩(合并)到单个 SSTable 中。在 24 小时内,您最终将获得 24 个 SSTable,每天每个小时一个。

假设您检查上午 9 点生成的 SSTable。该 SSTable 中的最小和最大[写入]时间戳将在上午 8 点到上午 9 点之间。

现在考虑一个场景,其中副本在上午 10 点左右错过了一些突变(写入)。上午 10 点到 11 点之间的所有写入都将被压缩到一个 SSTable 中。如果修复在下午 3 点运行,当天早些时候错过的突变将包含在下午 3 点到 4 点的时间窗口中,即使它确实属于上午 10 点到 11 点时间窗口的 SSTable。

在 TWCS 中,来自不同时间窗口的 SSTable 不会被压缩在一起。这意味着来自 2 个不同时间窗口的数据分散在 2 个 SSTable 中。即使上午 11 点的 SSTable 过期,也无法将其从磁盘中删除(删除),因为下午 4 点的 SSTable 中存在与其重叠的数据。在下午 4 点 SSTable 中的所有数据都过期之前,上午 11 点的 SSTable 不会被删除。

如何在 Cassandra 中维护数据。它包含一个漂亮的图表,希望能让您更轻松地可视化数据如何在 SSTable 之间重叠。干杯!

For TWCS, data is compacted into "time windows". If you've configured a time window of 1 hour, TWCS will compact (combine) all partitions written within a one-hour window into a single SSTable. Over a 24-hour period you will end up with 24 SSTables, one for each hour of the day.

Let's say you inspect the SSTable generated at 9am. The minimum and maximum [write] timestamps in that SSTable would be between 8am and 9am.

Now consider a scenario where a replica has missed a few mutations (writes) around 10am. All the writes between 10am and 11am will get compacted to one SSTable. If a repair runs at 3pm, the missed mutations from earlier in that day will get included in the 3pm to 4pm time-window even when it really belongs to the SSTable from the 10-11am time-window.

In TWCS, SSTables from different time windows will not get compacted together. This means that the data from 2 different time windows is fragmented across 2 SSTables. Even if the 11am SSTable is expired, it cannot be dropped (deleted) from disk because there is data in the 4pm SSTable that overlaps with it. The 11am SSTable will not get dropped until all the data in the 4pm SSTable has also expired.

There's a simplified explanation of how TWCS works in How data is maintained in Cassandra. It includes a nice diagram which would hopefully make it easier for you to visualise how data could possibly overlap across SSTables. Cheers!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文