查询 Cassandra 列族中 X 天内未更新的行
我正在将现有的基于 MySQL 的应用程序迁移到 Cassandra。到目前为止,找到等效的 Cassandra 数据模型非常容易,但我偶然发现了以下问题,我希望得到一些建议:
考虑一个包含数百万个实体的 MySQL 表:
CREATE TABLE entities (
id INT AUTO_INCREMENT NOT NULL,
entity_information VARCHAR(...),
entity_last_updated DATETIME,
PRIMARY KEY (id),
KEY (entity_last_updated)
);
每五分钟就会查询该表以查找以下实体:需要更新:
SELECT id FROM entities
WHERE entity_last_updated IS NULL
OR entity_last_updated < DATE_ADD(NOW(), INTERVAL -7*24 HOUR)
ORDER BY entity_last_updated ASC;
然后使用以下查询更新此查询返回的实体:
UPDATE entities
SET entity_information = ?,
entity_last_updated = NOW()
WHERE id = ?;
相应的 Cassandra 数据模型是什么,它允许我存储给定的信息并有效地查询实体表以查找需要更新的实体(即:过去 7 天内未更新的实体)?
I'm moving an existing MySQL based application over to Cassandra. So far finding the equivalent Cassandra data model has been quite easy, but I've stumbled on the following problem for which I'd appreciate some input:
Consider a MySQL table holding millions of entities:
CREATE TABLE entities (
id INT AUTO_INCREMENT NOT NULL,
entity_information VARCHAR(...),
entity_last_updated DATETIME,
PRIMARY KEY (id),
KEY (entity_last_updated)
);
Every five minutes the table is queried for entities that need to be updated:
SELECT id FROM entities
WHERE entity_last_updated IS NULL
OR entity_last_updated < DATE_ADD(NOW(), INTERVAL -7*24 HOUR)
ORDER BY entity_last_updated ASC;
The entities returned by this queries are then updated using the following query:
UPDATE entities
SET entity_information = ?,
entity_last_updated = NOW()
WHERE id = ?;
What would be the corresponding Cassandra data model that would allow me to store the given information and effectively query the entities table for entities that need to be updated (that is: entities that have not been updated in the last seven days)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为了实现您所描述的目的,您需要将列名称作为时间戳,并使用开始时间和结束时间使用获取切片函数,它将为您提供列名称在该范围内的所有行。还可以使用列名排序,这样您就可以得到按时间排序的结果。
To achieve what you described you need to have column name as time stamp and use get slice function using start time and endtime, it will give you all rows with column name with in that range. also use column name sort so you would get result in ordered by time.
您必须扫描所有行并从您感兴趣的列中获取时间戳。如果这是您每天左右运行的操作,那么在 Hadoop 作业中执行此操作应该没问题。如果您每隔几分钟运行一次,那么您将需要想出另一种方法。
You'd have to scan all the rows and grab the timestamp from the column(s) you're interested in. If this is something you run every day or so, doing this in a Hadoop job should be fine. If it's something you run every few minutes, then you'll need to come up with another approach.