我希望为一个相当复杂的关系数据库实现审计跟踪,该数据库的架构很容易发生变化。我正在考虑的一种途径是使用 DVCS 来跟踪更改。
(我能想象到的好处是:无模式历史记录,整个系统状态的快照,用于分析、回放和迁移的标准工具,高效存储,独立系统,保持数据库清洁。数据库写入量不大,历史记录也不重)不是核心功能,更多的是为了进行审计跟踪哦,我喜欢尝试疯狂的新方法来解决问题。)
我不是这些系统的专家(我只熟悉基本的 git),所以我不确定实施起来会有多么困难。我正在考虑采用 Mercurial 的方法,但可能将文件内容/清单/变更集存储在键值数据存储中,而不是使用实际文件。
数据行将被序列化为 json,每个“文件”可以是一行。或者,整个表可以存储在一个“文件”中,每行驻留在等于其主键的行号上(假设表不是太大,我预计所有表的行数都少于 4000 行左右。这可能意味着可以自动生成变更集,而无需查阅表“文件”的其余部分
(但我对此表示怀疑,因为我认为我们需要该表的 SHA-1 哈希值 。这些文件可能是。按可预测的行数分割,例如文件 1 中的 0 ,文件 2 中的 1000 等它们很小)
有没有熟悉 DVCS 内部结构或一般数据结构的人能够对这样的方法发表评论,它是如何工作的,甚至应该这样做吗? 我想这样
的系统有两个方面:1) 将 SQL 数据映射到 DVCS 系统,2) 将 DVCS 数据存储在键/值数据存储(而不是文件)中以提高效率。
(注意 json 序列化位已被我的 ORM 覆盖)
I'm looking to implement an audit trail for a reasonably complicated relational database, whose schema is prone to change. One avenue I'm thinking of is using a DVCS to track changes.
(The benefits I can imagine are: schemaless history, snapshots of entire system's state, standard tools for analysis, playback and migration, efficient storage, separate system, keeping DB clean. The database is not write-heavy and history is not not a core feature, it's more for the sake of having an audit trail. Oh and I like trying crazy new approaches to problems.)
I'm not an expert with these systems (I only have basic git familiarity), so I'm not sure how difficult it would be to implement. I'm thinking of taking mercurial's approach, but possibly storing the file contents/manifests/changesets in a key-value data store, not using actual files.
Data rows would be serialised to json, each "file" could be an row. Alternatively an entire table could be stored in a "file", with each row residing on the line number equal to its primary key (assuming the tables aren't too big, I'm expecting all to have less than 4000 or so rows. This might mean that the changesets could be automatically generated, without consulting the rest of the table "file".
(But I doubt it, because I think we need a SHA-1 hash of the whole file. The files could perhaps be split up by a predictable number of lines, eg 0 < primary key < 1000
in file 1, 1000 < primary key < 2000
in file 2 etc, keeping them smallish)
Is there anyone familiar with the internals of DVCS' or data structures in general who might be able to comment on an approach like this? How could it be made to work, and should it even be done at all?
I guess there are two aspects to a system like this: 1) mapping SQL data to a DVCS system and 2) storing the DVCS data in a key/value data store (not files) for efficiency.
(NB the json serialisation bit is covered by my ORM)
发布评论
评论(3)
我自己对此进行了一些研究,这里有一些评论可供分享。
尽管我原以为使用 python 中的 Mercurial 会让事情变得更容易,但 DVCS 的许多功能并不是必需的(尤其是分支、合并)。我认为简单地窃取一些设计决策并实现一个满足我的需求的基本系统会更容易。所以,这就是我的想法。
Blob
系统对要存档的记录进行 json 表示,并生成该记录的 SHA-1 哈希值(如果愿意,可以称为“节点 ID”)。该散列表示该记录在给定时间点的状态,与 git 的“blob”相同。
变更集
变更被分组到变更集中。变更集记录一些元数据(时间戳、提交者等)并链接到任何父变更集和当前“树”。
树
我没有使用 Mercurial 的“清单”方法,而是使用 git 的“树”结构。树只是 blob(模型实例)或其他树的列表。在顶层,每个数据库表都有自己的树。下一个级别可以是所有记录。如果有很多记录(经常有),可以将它们分成子树。
这样做意味着如果您只更改一条记录,则可以保留未更改的树。它还允许每个记录拥有自己的 blob,这使得管理变得更加容易。
存储
我喜欢 Mercurial 的修订日志想法,因为它允许您最大限度地减少数据存储(仅存储变更集),同时保持快速检索(所有变更集都在相同的数据结构中)。这是在每条记录的基础上完成的。
我认为像 MongoDB 这样的系统最适合存储数据(它必须是键值对的,而且我认为 Redis 过于专注于将所有内容保存在内存中,这对于存档来说并不重要)。它将存储变更集、树和修订日志。当前 HEAD 等的一些额外键,系统就完成了。
因为我们使用的是树,所以我们可能不需要将外键显式链接到它所引用的确切“blob”。只需使用主键就足够了。我希望!
用例:1. 归档更改
一旦发生更改,记录的当前状态就会序列化为 json,并为其状态生成哈希值。这是针对所有其他相关更改完成的,并将其打包到更改集中。完成后,相关的修订日志将被更新,使用新对象(“blob”)哈希值生成新的树和子树,并使用元信息“提交”变更集。
用例 2. 检索旧状态
找到相关变更集(MongoDB 搜索?)后,将遍历树,直到找到我们要查找的 blob ID。我们转到修订日志并检索记录的状态或使用可用的快照和变更集生成它。然后,用户必须决定是否也需要检索外键,但这很容易(使用我们开始时使用的相同变更集)。
摘要
这些操作都不应该太昂贵,并且我们对数据库的所有更改都有一个节省空间的描述。存档与生产数据库分开保存,使其能够完成自己的任务,并允许随着时间的推移对数据库模式进行更改。
I've looked into this a little on my own, and here are some comments to share.
Although I had thought using mercurial from python would make things easier, there's a lot of functionality that the DVCS's have that aren't necessary (esp branching, merging). I think it would be easier to simply steal some design decisions and implement a basic system for my needs. So, here's what I came up with.
Blobs
The system makes a json representation of the record to be archived, and generates a SHA-1 hash of this (a "node ID" if you will). This hash represents the state of that record at a given point in time and is the same as git's "blob".
Changesets
Changes are grouped into changesets. A changeset takes note of some metadata (timestamp, committer, etc) and links to any parent changesets and the current "tree".
Trees
Instead of using Mercurial's "Manifest" approach, I've gone for git's "tree" structure. A tree is simply a list of blobs (model instances) or other trees. At the top level, each database table gets its own tree. The next level can then be all the records. If there are lots of records (there often are), they can be split up into subtrees.
Doing this means that if you only change one record, you can leave the untouched trees alone. It also allows each record to have its own blob, which makes things much easier to manage.
Storage
I like Mercurial's revlog idea, because it allows you to minimise the data storage (storing only changesets) and at the same time keep retrieval quick (all changesets are in the same data structure). This is done on a per record basis.
I think a system like MongoDB would be best for storing the data (It has to be key-value, and I think Redis is too focused on keeping everything in memory, which is not important for an archive). It would store changesets, trees and revlogs. A few extra keys for the current HEAD etc and the system is complete.
Because we're using trees, we probably don't need to explicitly link foreign keys to the exact "blob" it's referring to. Justing using the primary key should be enough. I hope!
Use case: 1. Archiving a change
As soon as a change is made, the current state of the record is serialised to json and a hash is generated for its state. This is done for all other related changes and packaged into a changeset. When complete, the relevant revlogs are updated, new trees and subtrees are generated with the new object ("blob") hashes and the changeset is "committed" with meta information.
Use case 2. Retrieving an old state
After finding the relevant changeset (MongoDB search?), the tree is then traversed until we find the blob ID we're looking for. We go to the revlog and retrieve the record's state or generate it using the available snapshots and changesets. The user will then have to decide if the foreign keys need to be retrieved too, but doing that will be easy (using the same changeset we started with).
Summary
None of these operations should be too expensive, and we have a space efficient description of all changes to a database. The archive is kept separately to the production database allowing it to do its thing and allowing changes to the database schema to take place over time.
如果数据库的写入量不是很大(正如您所说),为什么不以实现您目标的方式实现实际的数据库表呢?例如,添加“版本”列。然后永远不要更新或删除行,除了这个特殊列,您可以将其设置为 NULL 表示“当前”,1 表示“最古老的已知”,然后从那里向上移动。当您想要更新一行时,将其版本设置为下一个更高版本,然后插入一个没有版本的新行。然后,当您查询时,只需选择具有空版本的行即可。
If the database is not write-heavy (as you say), why not just implement the actual database tables in a way that achieves your goal? For example, add a "version" column. Then never update or delete rows, except for this special column, which you can set to NULL to mean "current," 1 to mean "the oldest known", and go up from there. When you want to update a row, set its version to the next higher one, and insert a new one with no version. Then when you query, just select rows with the empty version.
查看 cqrs 和 Greg Young 的事件源。我还有一篇关于元事件工作的博客文章,它指出了业务事件之河中的架构变化。
http://adventuresinagile.blogspot.com/2009/09 /rewind-button-for-your-application.html
如果您浏览我的博客,您还会发现版本脚本方案,并且您可以通过源代码控制这些方案。
Take a look at cqrs and Greg Young's event sourcing. I also have a blog post about working in meta events that pin point schema changes within the river of business events.
http://adventuresinagile.blogspot.com/2009/09/rewind-button-for-your-application.html
If you look through my blog, you'll also find version script schemes and you can source code control those.