当前位置：文江博客话题详情

在 Cassandra 中实现数据版本控制的方法

发布于 2024-10-02 11:03:55 字数 819 浏览 9 评论 0原文

您能否分享一下您将如何在 Cassandra 中实现数据版本控制的想法。

假设我需要对一个简单的地址簿中的记录进行版本控制。（地址簿记录作为行存储在 ColumnFamily 中）。我希望历史：

将不经常使用
将被一次全部使用，以“时间机器”的方式呈现它，
单个记录的版本不会超过几百个。
历史不会过期。

我正在考虑以下方法：

将地址簿转换为超级列族，并将多个版本的地址簿记录存储在一行中，作为超级列键入（按时间戳）。
创建新的超级列族来存储旧记录或对记录的更改。这样的结构如下所示：
{ '地址簿行键': { '时间戳1': { '名字': '新名字', '修改者': '用户 ID', },
```
'时间戳2': {
        '名字': '新名字',
        '修改者': '用户 ID',
    },
},
```
'另一个地址簿行键': { '时间戳'：{ ....
将版本存储为附加在新 ColumnFamilly 中的序列化 (JSON) 对象。将版本集表示为行，版本集表示为列。（模仿使用 CouchDB 进行简单文档版本控制）

原文

Can you share your thoughts how would you implement data versioning in Cassandra.

Suppose that I need to version records in an simple address book. (Address book records are stored as Rows in a ColumnFamily).
I expect that the history:

will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.

I'm considering the following approach:

Convert the address book to Super Column Family and store multiple version of address book records in one Row keyed (by time stamp) as super columns.
Create new Super Column Family to store old records or changes to the records.
Such structure would look as follows:
{
'address book row key': {
'time stamp1': {
'first name': 'new name',
'modified by': 'user id',
},
```
'time stamp2': {
        'first name': 'new name',
        'modified by': 'user id',
    },
},
```
'another address book row key': {
'time stamp': {
....
Store versions as serialized (JSON) object attached in new ColumnFamilly. Representing sets of version as rows and versions as columns. (modelled after Simple Document Versioning with CouchDB)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼前雾蒙蒙 2024-10-09 11:03:55

如果您可以添加地址簿中的条目通常少于 10,000 个的假设，那么在超级列族中每个地址簿时间线使用一行将是一种不错的方法。

一行看起来像：

{'address_book_18f3a8':
  {1290635938721704: {'entry1': 'entry1_stuff', 'entry2': 'entry2_stuff'}},
  {1290636018401680: {'entry1': 'entry1_stuff_v2', ...},
  ...
}

其中行键标识地址簿，每个超级列名称是一个时间戳，子列表示该版本的地址簿内容。

这将允许您仅通过一次查询读取最新版本的地址簿，并通过一次插入写入新版本。

如果地址簿元素少于 10,000 个，我建议使用此选项的原因是，当您读取单个子列时，超级列必须完全反序列化。总的来说，在这种情况下并没有那么糟糕，但需要记住这一点。

另一种方法是每个版本的地址簿使用单行，并使用每个地址簿具有时间线行的单独 CF，如下所示：

{'address_book_18f3a8': {1290635938721704: some_uuid1, 1290636018401680: some_uuid2...}}

这里，some_uuid1 和 some_uuid2 对应于这些版本的地址簿的行键。这种方法的缺点是每次读取地址簿时都需要两次查询。好处是它可以让您有效地仅阅读地址簿的选定部分。

If you can add the assumption that address books typically have fewer than 10,000 entries in them, then using one row per address book time line in a super column family would be a decent approach.

A row would look like:

{'address_book_18f3a8':
  {1290635938721704: {'entry1': 'entry1_stuff', 'entry2': 'entry2_stuff'}},
  {1290636018401680: {'entry1': 'entry1_stuff_v2', ...},
  ...
}

where the row key identifies the address book, each super column name is a time stamp, and the subcolumns represent the address book's contents for that version.

This would allow you to read the latest version of an address book with only one query and also write a new version with a single insert.

The reason I suggest using this if address books are less than 10,000 elements is that super columns must be completely deserialized when you read even a single subcolumn. Overall, not that bad in this case, but it's something to keep in mind.

An alternative approach would be to use a single row per version of the address book, and use a separate CF with a time line row per address book like:

{'address_book_18f3a8': {1290635938721704: some_uuid1, 1290636018401680: some_uuid2...}}

Here, some_uuid1 and some_uuid2 correspond to the row key for those versions of the address book. The downside to this approach is that it requires two queries every time the address book is read. The upside is that it lets you efficiently read only select parts of an address book.

回复收藏 0 原文