在Scylladb中查询100亿行(具有高基数)的效率
假设我有一张桌子,上面有10亿行分布在100台机器上。该表具有以下结构:
PK1 PK2 PK3 V1 V2
其中pk
代表分区密钥,v
代表一个值。因此,在上面的示例中,分区密钥由3列组成。
Scylla要求您在子句中指定分区密钥的所有列。
如果您想在仅指定一些列时执行查询,因为这需要完整的表扫描:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
在上面查询中,我们仅指定3列中的2列。假设查询匹配 100亿行中的10亿行 - 考虑到此查询的成本/性能的好心理模型是什么?
我的假设是成本很高:这相当于在数据集上执行100亿个单独的查询,因为 1) 在行之间没有逻辑关联,就像将行存储到磁盘上一样每行都有一个不同的分区密钥(高基数) 2),以便Scylla确定哪个行匹配它必须扫描所有100亿行的查询(即使结果设置仅匹配10亿行)
假设一台服务器可以每秒处理100K交易(远离scylladb
folks所宣传的范围内),并且数据位于100台服务器上,则可以将处理此查询的(估计的)时间计算为: 100K * 100 =每秒1000万查询。 100亿美元除以10m = 1,000秒。因此,将群集约为。 1,000秒处理查询(消耗所有集群资源)。
这是正确的吗?还是我的心理模型中有任何缺陷,即scylla
处理此类查询?
谢谢
Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:
PK1 PK2 PK3 V1 V2
Where PK
represents a partition key and V
represents a value. So in the above example, the partition key consists of 3 columns.
Scylla requires that you to specify all columns of the partition key in the WHERE
clause.
If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?
My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)
Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB
folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).
Is this correct? Or is there any flaw in my mental model of how Scylla
processes such queries?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如您所建议的那样,Scylla(我在答案中要说的一切也适用于Cassandra)保留了整个分区键的分区 - 包含三列。 ּ因此,Scylla没有有效的方法来扫描 匹配分区。它必须扫描所有分区,并检查每个分区是否与请求匹配。
但是,这并不意味着它与“在数据上执行100亿个单独的查询”一样严重效率低下。通常(每行数据本身不是很大)比执行100亿个随机访问读取的扫描(当每个行的数据本身不是很大时)扫描得多,每个分别读取单个分区。随机访问读取中有很多工作 - Scylla需要达到协调器,然后将其发送到副本,每个副本都需要在其一盘数据文件中找到特定位置(通常是多个文件),通常需要从磁盘上读取(作为磁盘和压缩对齐需要),等等。与此相比,扫描可以读取由磁盘从磁盘(partition -key哈希)排序的长连续数据,并且可以很快返回许多行,而I/O操作较少,而CPU工作较少。
因此,如果您的示例设置可以进行100,000个随机访问每个节点读取,则在扫描过程中可能会读取每秒超过100,000行。我不知道要给您哪个确切的编号,但是博客文章 https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-scaled-to-no-billion-billion-rows-a-a-a-second/ 我们(全面披露:我是Scylladb开发人员)展示了一个例子,用例扫描了每秒十亿(!)行,只有83个节点 - 每个节点每秒1200万行,而不是您的估算值为100,000。因此,您的示例用例可能会在8.3秒内结束,而不是您计算的1000秒。
最后,请不要忘记(在上述博客文章中也提到),如果您进行大规模扫描,则应明确并行,即,将令牌范围分成部分,然后扫描然后扫描平行线。首先,显然没有一个客户能够处理每秒扫描十亿个分区的结果,因此这种并行化或不可避免的是不可避免的。其次,按分区顺序进行扫描返回分区,该分区(如上所述)连续地坐在单个副本上 - 这非常适合峰值吞吐量。因此,将扫描分成碎片并并行进行所有扫描很重要。我们还提供了一篇博客文章,介绍了并行扫描的重要性,以及如何做: https://www.scylladb.com/2017/03/03/parallel-fellity-efffite--flowl-table-table-table-scan-scancylla/ 。
As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ּSo Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.
However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.
So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.
Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.
另一个选择是将PK移动成为聚类密钥,如果您拥有前两个PK,则可以找到分区,然后搜索它
Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it