分区键＆amp; Scylladb中的数据建模

发布于 2025-01-20 06:15:44 字数 671 浏览 0 评论 0原文

在 Scylla 中，数据是按分区键存储的。如果我查询一个有很多分区键的大表，是否相当于对该表执行多个查询？例如，假设我有下表：

key1 : val1
key2 : val2
key3 : val3

其中 3 个键 (key1..3) 中的每一个都是不同的分区键。

如果我对表执行以下查询：

SELECT * from table.

Scylla，可能需要在 3 个不同的分区上执行此查询 3 次，因为每行都存储在不同的分区上。这看起来效率很低，因为这意味着每个分区将执行一次查询。假设数据被划分为100个分区（100个键），查询是否需要执行100次才能完成？（并且通过扩展，查询只会与最慢的服务器一样快？）

如果这是真的，那么从 3 个单独的表中查询 1 行（例如，每行具有不同的分区键），应该具有与以下情况相同的性能：从一个表中查询 3 行，其中 3 个三行中的每一个都有不同的分区键？换句话说，数据是否被建模为一个表或多个表的一部分并不重要。重要的是两行或多行是否共享相同的分区键？

当我们查询 3 个不同的表，每行都具有相同的分区键时，会发生什么情况，这与从一个表中查询 3 行（其中所有行都具有相同的分区键）一样高效吗？

任何评估上述 3 种情况下的绩效预期的指导都会非常有帮助。

谢谢！

原文

In Scylla, data is stored by partition key. If I query a large table with many partition keys, is it equivalent to executing multiple queries against the table? For example, suppose I have the following table:

key1 : val1
key2 : val2
key3 : val3

Where each of the 3 keys (key1..3) is a different partition key.

If I execute the following query against the table:

SELECT * from table.

Scylla, will presumably need to execute this query 3 times - on 3 different partitions since each row is stored on a different partition. It seems inefficient, as it means the query will be executed once per partition. Suppose the data was partitioned into 100 partitions (100 keys), will the query need to be executed 100 times to complete? (and by extension, the query will only be as fast as the slowest server?)

If this is true, then querying 1 row from 3 separate tables (e.g, where each row has a different partition key), should have identical performance as when querying 3 rows from one table where each of 3 three rows has a different partition key? In other words, whether the data is modeled as part of one table or multiple tables, doesn't really matter. What matters is whether two or more rows share the same partition key?

What happens when we query 3 different tables were each row has the same partition key, is this as efficient as querying 3 rows from one table where all of the rows have the same partition key?

Any guidance in evaluating performance expectations in the 3 scenarios described above would be very helpful.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

执笔绘流年 2025-01-27 06:15:44

正如您所指出的，查询SELECT * FROM table 不是单个分区中的查询，而是全表扫描。全表扫描是“昂贵的”，因为它需要读取表中的所有数据（如果您运行它完成），但它并不像您想象的那样低效：

Scylla 或 Cassandra 可以不通过查找现有分区键列表来开始这样的查询 - 然后单独查询每个分区键。相反，Scylla 和 Cassandra 对分区键具有确定性顺序，即所谓的“令牌”顺序（您可以将分区键的“令牌”视为哈希函数）。各个服务器节点保存这些令牌的连续范围，因此扫描整个表是通过扫描这些连续令牌范围（也称为“vnode”）中的每一个来实现的 - 每个节点都是通过单个节点有效地从其自身节点顺序读取数据来有效实现的磁盘。因此，您可以拥有一百万甚至十亿个分区，并且用于读取整个表的 SELECT * FROM table 仍将涉及从磁盘进行的大部分顺序读取 - 而不是对各个分区进行一百万或十亿次查找。

我觉得有必要提出的另一个评论是，如果您考虑只有 3 个分区，并担心将数量增加到 100，那么您就误解了 Scylla（和 Cassandra）中的数据建模。事实上，100 个分区还是太少了。您应该有超过 100 个分区。越多越好。原因是，如果只有几个巨大的分区，数据将不会均匀地分布在节点和分片（CPU）之间。如果您只有 3 个分区和 100 个 CPU，由于每个分区都由一个 CPU（在 Cassandra 中为一台服务器）拥有，因此 100 个 CPU 中只有 3 个在工作，这当然不是一个好主意。拥有 100 万个分区比只有 3 个分区要好得多。

回复收藏 0 原文

心清如水 2025-01-27 06:15:44

将来，请尝试每个问题只问一个问题。

SELECT * from table

没有确定精确分区的能力，驱动程序将在集群中选择一个节点来发送查询。该节点成为此查询的“协调员”。然后，它将请求发送到集群中的每个节点，并构建结果集。完成后，协调员将结果设置回驱动程序。在这种特殊情况下，它必须对群集中的所有节点进行轮询才能找到3行……效率不高。

这就是为什么在Cassandra/Scylla世界中，Unbound Bound询问确实不是一个好主意的原因，因为一个节点成为负责从所有其他节点中进行数据进行轮询数据的原因。在大型群集中，大数据方案，作为一个协调器的节点可以减慢甚至崩溃并不是闻所未闻的。

如果这是真的，则从3个单独的表（例如，每行都有一个不同的分区密钥）查询1行，应具有相同的性能

我应该从阅读本文中具有相同的性能，即作为每个查询的<<代码>其中子句。从3个单独的表中查询单个特定行将更快。基本上，它不需要详尽检查集群中的每个节点。驱动程序可以简单地放置三个分区键，并且知道去哪里获取数据。如果驱动程序正在使用令牌感知负载平衡策略，则三个查询将在这种情况下更快地执行，因为不需要单个节点充当协调员，跳过一个网络旅行。

当我们查询3个不同的表格时，会发生什么，每行具有相同的分区键，这与从一个表中查询3行的3行一样有效，该表所有行都有相同的分区键？

这将与以前的方案相似，在该方案中将运行三个不同的查询。他们都进入同一分区的事实不应该有很大的不同，除了相同的节点将用于提供数据。

欲了解其他参考，这是 farterbolance Scylla文档的图表。它提供了有关读写操作路径以及复制因子，一致性级别和多个节点的影响的更多视觉细节。

In the future, please try to ask only one question per question.

SELECT * from table

Without the ability to determine an exact partition, the driver will choose a node in the cluster to send the query. This node becomes a "coordinator" for this query. It then sends requests out to each node in the cluster, and builds the result set. Once complete, the coordinator returns the result set back to the driver. In this particular case, it has to poll all nodes in the cluster to find only 3 rows...not terribly efficient.

This is why unbound queries really aren't a good idea in the Cassandra/Scylla world, as that one node becomes responsible for polling the data from all of the other nodes. In a large cluster, large data scenario, it is not unheard of for a node acting as a coordinator to slow down, or even crash.

If this is true, then querying 1 row from 3 separate tables (e.g, where each row has a different partition key), should have identical performance

I assume from reading this, that a partition key is being provided as a part of each query's WHERE clause. Querying a single, specific row from 3 separate tables will be faster. Basically, it won't need to exhaustively check every node in the cluster. The driver can simply hash the three partition keys, and know exactly where to go to get the data. The three queries will perform even faster in this scenario if the driver is using a token-aware load balancing policy, as there is no need for a single node to act as coordinator, skipping one network trip.

What happens when we query 3 different tables were each row has the same partition key, is this as efficient as querying 3 rows from one table where all of the rows have the same partition key?

This will perform similarly to previous scenario, where three distinct queries will be run. The fact that they all go to the same partition shouldn't make much of a difference, other than the same nodes will be used to serve the data.

For additional reference, here's a link to a Fault Tolerance diagram from Scylla's docs. It provides more, visual detail on the paths of read and write operations, as well as the effects of replication factor, consistency level, and multiple nodes.

回复收藏 0 原文

~没有更多了~