Cassandra适合骨料查询吗?
我已经读到柱状数据库易于进行聚合查询,而Cassandra是柱状数据库。我正在尝试在卡桑德拉(Cassandra)中使用计数(在“> ='之间”)。这种表演很密集吗?
I have read that Columnar databases are apt for Aggregate Queries and Cassandra is a columnar database. I am trying to use count( values 'between' or '>=' for a specific partition) in Cassandra. Is this performance intensive?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是Cassandra是柱状数据库的普遍误解。我认为它来自餐桌上的旧术语“专栏族”。数据存储在包含键值对的列的行中,这就是为什么表曾经称为列族的表的原因。
与传统的关系数据库相比,主要区别是Cassandra表可以是二维(每个记录都包含一个行)或多维(每个记录都包含一个或多个行)。
另一方面,柱状数据库翻转了一个二维表,因此数据存储在列而不是行中,专门针对分析类型查询(例如聚合)进行了优化 - 这不是Cassandra。
回到您的问题,可以为大多数数据模型计算单个分区中的行。关键是将查询限制在一个分区中,例如:
只要限制在一个分区中,就可以计算范围查询中的行:
如果您不将查询限制为单个分区,则可以可能适用于(a)非常小的数据集和(b)节点数量很少的群集,但不扩展为(c)数据集的增长,并且(d)节点的数量增加。我已经解释了为什么在这篇文章的卡桑德拉(Cassandra)中执行诸如
count()之类的骨料 - https://community.datastax.com/questions/6897/ 。
这并不是说Cassandra不合适。如果您的主要用例用于为OLTP工作负载存储实时数据,则Cassandra是一个不错的选择。对于分析查询,您只需要使用其他软件,例如apache spark,因为将优化Cassandra的查询。干杯!
It's a common misconception that Cassandra is a columnar database. I think it comes from the old terminology "column family" for tables. Data is stored in rows containing columns of key-value pairs which is why the tables used to be called column families.
A major difference compared to traditional relational databases is that Cassandra tables can be 2-dimensional (each record contains exactly one row) or multi-dimensional (each record can contain ONE OR MORE rows).
On the other hand, columnar databases flips a 2-dimensional table such that data is stored in columns instead of rows, specifically optimised for analytics-type queries such as aggregations -- this is NOT Cassandra.
Going back to your question, counting the rows within a single partition is ok to do for most data models. The key is to restrict the query to just one partition like:
It's also OK to count the rows in a range query as long as they're restricted to one partition like:
If you don't restrict the query to a single partition, it might work for (a) very small datasets and (b) clusters with a very low number of nodes but it doesn't scale as (c) the dataset grows, and (d) the number of nodes increases. I've explained why performing aggregates such as
COUNT()
is bad in Cassandra in this post -- https://community.datastax.com/questions/6897/.This is not to say that Cassandra isn't a good fit. Cassandra is a good choice if your primary use case is for storing real-time data for OLTP workloads. For analytics queries, you just need to use other software like Apache Spark since the spark-cassandra-connector will optimise the queries to Cassandra. Cheers!
Cassandra是一家分区的行商店。数据存储在分区中,聚集在一起并用作“行”。它不是 一个柱状数据库。
在卡桑德拉(Cassandra)上进行计数的总疑问将表现不佳。要尝试将是的性能密集型,直到协调器节点时间段出去查询为止。
如果这是您需要解决的用例,那么另一个数据库将是更好的选择。
Cassandra is a partitioned row store. Data is stored in partitions, clustered together and served as "rows." It is not a columnar database.
An aggregate query to run a count will not perform well on Cassandra. To attempt it will be performance intensive, right up until the coordinator node times-out the query.
If this is a use case you need to solve for, another database will be the better option.
添加到 @aaron的响应中,如果您在分区中执行 Just 的汇总操作,那可能还可以。例如,
让我们假设您的表模式如下:
可以做 contregation 诸如表现的查询,
Adding to @aaron's response, if you're performing an aggregate operation just within your partition, that might be okay. For example,
Let's assume your table schema is as follows:
it may be okay to do aggregation queries such as the following to be performant,