Scylla DB潜伏期每15天增加一次

发布于 2025-02-05 08:33:35 字数 1945 浏览 3 评论 0原文

通常，读取延迟约为3-4毫秒，但是在每15-20天后，潜伏期的射击最高可达100-150毫秒，因此，我需要重新启动整个集群。阅读修复每午夜运行。我无法弄清楚这个问题。

我有一个8节点Scylla群集（版本4.1）。所有节点都在同一数据中心内。在那，我有5个关键空间。

在3个键空间中的5个键空间中，我在压缩它后正在存储JSON BLOB，并且在 cl = local_quorum上查询读取和写入。对于这些键空间，读取比率相当相同，并且为插入的每个记录设置了6个月的TTL。每个键空间只有1个表。

对于其他2个键空间，我正在存储一些内部配置，并且我正在使用 cl = all编写表格，并使用cl = local_one 读取它们。读写比为10：1，并且为插入的记录设置了未设置的TTL。每个钥匙空间都有大约5-8个表。

以下是所有表/键空间的配置：

Replciation factor = 3,
compaction = {'class': 'SizeTieredCompactionStrategy'},
compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'},
crc_check_chance = 1.0,
dclocal_read_repair_chance = 0.1,
default_time_to_live = 0,
gc_grace_seconds = 864000,
max_index_interval = 2048,
memtable_flush_period_in_ms = 0,
min_index_interval = 128,
read_repair_chance = 0.0,
speculative_retry = '99.0PERCENTILE'

在下图中，我于5月5日重新启动了集群，从5月5日到20月20日，读取延迟约为3-4 ms，但从5月20日起，它开始以指数的方式和On上的增长第23届达到100-150毫秒时，我重新启动了集群，并且阅读延迟恢复正常。

延迟

我怀疑以下图表，延迟的增加可能是由于读取可能会引起的。那个时期的磁盘。但是量子很少。

Usually, the read latency is around 3-4 ms but after every 15-20 days latency shoots up to 100-150 ms, and because of this I need to restart our whole cluster. The read-repair runs every midnight. I am unable to figure out the issue.

I have an 8-node Scylla cluster (version 4.1). All nodes are within the same data center. In that, I have 5 key spaces.

Out of 5 keyspaces in 3 keyspaces, I am storing JSON blob after compressing it and I am querying on CL = LOCAL_QUORUM for both reads and writes. For these keyspaces, the read and write ratio is fairly the same and 6months TTL is set for each record inserted. Each keyspace has only 1 table.

For the other 2 keyspaces, I am storing some internal configs and I am writing the tables using CL = ALL and reading them using CL = LOCAL_ONE. The read writes ratio is 10:1 and no ttl is set for the record inserted. Each keyspace has around 5-8 tables.

Below is the configuration of all tables/keyspace:

Replciation factor = 3,
compaction = {'class': 'SizeTieredCompactionStrategy'},
compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'},
crc_check_chance = 1.0,
dclocal_read_repair_chance = 0.1,
default_time_to_live = 0,
gc_grace_seconds = 864000,
max_index_interval = 2048,
memtable_flush_period_in_ms = 0,
min_index_interval = 128,
read_repair_chance = 0.0,
speculative_retry = '99.0PERCENTILE'

In below graph I restarted the cluster on 5th may and from 5th May till 20th May the read latency was around 3-4 ms but from 20th may onwards it started increasing in an exponential manner and on 23rd when it reached 100-150 ms I have restarted the cluster and read latency was back to normal.

Latency

I suspect seeing the below graph that the increase in latency might be due to the reads going to the disk in that time period. But the quantum is very less.