选择具有最高时间戳的行

发布于 2025-01-11 03:26:38 字数 420 浏览 2 评论 0原文

我有一个存储事件的表

CREATE TABLE active_events (
    event_id VARCHAR,
    number VARCHAR,
....
    start_time TIMESTAMP,
    PRIMARY KEY (event_id, number)
);

现在,我想选择 start_time 最高的事件。有可能吗?我尝试过创建二级索引,但没有成功。

这是我创建的一个查询

select * from active_call order by start_time limit 1

,但错误显示仅当分区键受 EQ 或 IN 限制时才支持 ORDER BY。 我应该创建某种物化视图吗?我应该做什么来执行我的查询?

I have a table that stores events

CREATE TABLE active_events (
    event_id VARCHAR,
    number VARCHAR,
....
    start_time TIMESTAMP,
    PRIMARY KEY (event_id, number)
);

Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.

This is a query I've created

select * from active_call order by start_time limit 1

But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

述情 2025-01-18 03:26:38

这是 Cassandra 中的反模式。要对数据进行排序,您需要读取所有数据并找到最高值。而这将需要扫描多个节点上的数据,并且时间会很长。

物化视图也没有多大帮助,因为数据顺序仅存在于单个分区内,因此您需要将所有数据放入单个分区中,该分区可能很大并且数据会不平衡。

我只能想到以下解决方法:

  • 有一个附加表,其中包含原始表的所有列,但具有假分区键并且没有集群列

  • 您与正常插入并行地插入到该表中,但对该假分区键使用固定值,并显式设置 记录的时间戳等于start_time(不要忘记乘以1000,因为时间戳使用微秒)。在这种情况下,它将保证是具有最高时间戳的值,因为 Cassandra 不会用具有较低时间戳的其他数据覆盖它。

但这并不能解决数据倾斜的问题,并且所有流量将由等于 RF 的固定数量的节点处理。

另一种选择 - 使用另一个数据库。

This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.

Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.

I can only think of following workaround:

  • Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns

  • You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.

But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.

Another alternative - use another database.

贱人配狗天长地久 2025-01-18 03:26:38

这种类型的查询在大数据中无效,因为它需要全表扫描并且无法扩展。它适用于传统的关系数据库,因为数据集较小。想象一下,您有数十亿个分区,每个分区有数千行分布在数百个节点上。如果允许的话,大型集群中的全表扫描将花费很长时间。

返回错误:

ORDER BY is only supported when the partition key is restricted by an EQ or an IN

因为您只能对提供的结果进行排序 (a) 查询仅限于分区键,并且 (b) 行按集群列排序。您无法根据不属于聚集键的列对结果进行排序。干杯!

This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.

The error:

ORDER BY is only supported when the partition key is restricted by an EQ or an IN

gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文