数据建模以促进修剪/批量更新/删除在Scylladb/cassandra中
可以说,我有一个类似于复合分区键的表。
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
假设有一个批处理作业可以修剪所有比X年龄更古老的数据。我不能在查询中进行查询,因为它缺少查询中的其他分区密钥。
DELETE FROM heartrate WHERE date < '2020-01-01';
您如何建模数据以在Scylla中实现的方式进行建模?我知道内部scylla会根据分区键创建一个分区,但是在这种情况下,不可能查询所有pet_chip_id
的列表,并执行n个查询以删除。
只是想知道人们如何在RDBMS世界之外做到这一点。
Lets say I have a table like below with a composite partition key.
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
Lets say there is a batch job to prune all the data older than X. I can't do below query since its missing other partition key in the query.
DELETE FROM heartrate WHERE date < '2020-01-01';
How do you model your data such a way that this can be achieved in Scylla? I understand that internally scylla creates a partition based on partition keys but in this case its impossible to query all the list of pet_chip_id
and do N queries to delete.
Just wanted to know how people do this outside RDBMS world.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在scylla中自动删除旧数据的推荐方法是使用时间实时(ttl)功能:
编写一行时,添加“使用TTL 864000”是您希望该数据在10天内自动删除。您还可以为给定表格指定默认的ttl,以便在(例如)10天之后写入表上的每片数据都会过期。
Scylla的TTL功能与数据本身分开,因此您用作分区键或聚类键的列无关紧要 - 特别是“日期”列不再需要是群集键(或根本存在,为此,物质) - 除非您还需要其他东西。
The recommended way to delete old data automatically in Scylla is using the Time-to-live (TTL) feature:
When you write a row, you add "USING TTL 864000" is you want that data to be deleted automatically in 10 days. You can also specify a default TTL for a given table, so that every piece of data written to the table will get expired after (say) 10 days.
Scylla's TTL feature is separate from the data itself, so it doesn't matter which columns you used as partition keys or clustering keys - in particular the "date" column no longer needs to be a clustering key (or exist at all, for that matter) - unless you also need it for something else.
正如 @nadav-Harel在他的回答中所说的那样,如果您可以定义一个始终是最好的解决方案的TTL您需要在删除查询中使用的字段。在修剪工作中,您可以首先从MV中进行选择,然后使用从MV获得的值从主表中删除。
示例:
然后,在您的修剪工作中,您可以基于
my_mv
从b
中选择,然后根据从选择QUERY返回的值从my_table
删除。请注意,此解决方案可能不是根据您的模型和所拥有的数据量有效的,但是请记住,删除数据也是查询数据的一种方法,并且应根据查询需求定义您的模型,即在定义之前定义数据您的模型,您需要考虑各种方式(包括如何修剪数据)。
As @nadav-harel said in his answer if you can define a TTL that's always the best solution but if you can't, a possible solution is to create a materialized view to be able to list the primary keys of the main table based on the field that you need to use in the delete query. In the prune job you can first do a select from the MV and then delete from the main table using the values that you got from the MV.
Example:
Then in your prune job you could select from
my_mv
based onb
and then delete frommy_table
based on the values returned from the select query.Note that this solution might not be effective depending on your model and the amount of data you have, but keep in mind that deleting data is also a way of querying your data and your model should be defined based on your queries needs, i.e. before defining your model, you need to think about every way you will query it (including how you will prune your data).