当前位置：文江博客话题详情

使用 DVCS 进行 RDBMS 审计跟踪

发布于 2024-11-16 03:13:52 字数 857 浏览 5 评论 0 原文

我希望为一个相当复杂的关系数据库实现审计跟踪，该数据库的架构很容易发生变化。我正在考虑的一种途径是使用 DVCS 来跟踪更改。

^{（我能想象到的好处是：无模式历史记录，整个系统状态的快照，用于分析、回放和迁移的标准工具，高效存储，独立系统，保持数据库清洁。数据库写入量不大，历史记录也不重）不是核心功能，更多的是为了进行审计跟踪哦，我喜欢尝试疯狂的新方法来解决问题。）}

我不是这些系统的专家（我只熟悉基本的 git），所以我不确定实施起来会有多么困难。我正在考虑采用 Mercurial 的方法，但可能将文件内容/清单/变更集存储在键值数据存储中，而不是使用实际文件。

数据行将被序列化为 json，每个“文件”可以是一行。或者，整个表可以存储在一个“文件”中，每行驻留在等于其主键的行号上（假设表不是太大，我预计所有表的行数都少于 4000 行左右。这可能意味着可以自动生成变更集，而无需查阅表“文件”的其余部分

（但我对此表示怀疑，因为我认为我们需要该表的 SHA-1 哈希值。这些文件可能是。按可预测的行数分割，例如文件 1 中的 0 ，文件 2 中的 1000 等它们很小）

有没有熟悉 DVCS 内部结构或一般数据结构的人能够对这样的方法发表评论，它是如何工作的，甚至应该这样做吗？我想这样

的系统有两个方面：1) 将 SQL 数据映射到 DVCS 系统，2) 将 DVCS 数据存储在键/值数据存储（而不是文件）中以提高效率。

^{（注意 json 序列化位已被我的 ORM 覆盖）}

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉生梦死 2024-11-23 03:13:52

我自己对此进行了一些研究，这里有一些评论可供分享。

尽管我原以为使用 python 中的 Mercurial 会让事情变得更容易，但 DVCS 的许多功能并不是必需的（尤其是分支、合并）。我认为简单地窃取一些设计决策并实现一个满足我的需求的基本系统会更容易。所以，这就是我的想法。

Blob

系统对要存档的记录进行 json 表示，并生成该记录的 SHA-1 哈希值（如果愿意，可以称为“节点 ID”）。该散列表示该记录在给定时间点的状态，与 git 的“blob”相同。

变更集

变更被分组到变更集中。变更集记录一些元数据（时间戳、提交者等）并链接到任何父变更集和当前“树”。

树

我没有使用 Mercurial 的“清单”方法，而是使用 git 的“树”结构。树只是 blob（模型实例）或其他树的列表。在顶层，每个数据库表都有自己的树。下一个级别可以是所有记录。如果有很多记录（经常有），可以将它们分成子树。

这样做意味着如果您只更改一条记录，则可以保留未更改的树。它还允许每个记录拥有自己的 blob，这使得管理变得更加容易。

存储

我喜欢 Mercurial 的修订日志想法，因为它允许您最大限度地减少数据存储（仅存储变更集），同时保持快速检索（所有变更集都在相同的数据结构中）。这是在每条记录的基础上完成的。

我认为像 MongoDB 这样的系统最适合存储数据（它必须是键值对的，而且我认为 Redis 过于专注于将所有内容保存在内存中，这对于存档来说并不重要）。它将存储变更集、树和修订日志。当前 HEAD 等的一些额外键，系统就完成了。

因为我们使用的是树，所以我们可能不需要将外键显式链接到它所引用的确切“blob”。只需使用主键就足够了。我希望！

用例：1. 归档更改

一旦发生更改，记录的当前状态就会序列化为 json，并为其状态生成哈希值。这是针对所有其他相关更改完成的，并将其打包到更改集中。完成后，相关的修订日志将被更新，使用新对象（“blob”）哈希值生成新的树和子树，并使用元信息“提交”变更集。

用例 2. 检索旧状态

找到相关变更集（MongoDB 搜索？）后，将遍历树，直到找到我们要查找的 blob ID。我们转到修订日志并检索记录的状态或使用可用的快照和变更集生成它。然后，用户必须决定是否也需要检索外键，但这很容易（使用我们开始时使用的相同变更集）。

摘要

这些操作都不应该太昂贵，并且我们对数据库的所有更改都有一个节省空间的描述。存档与生产数据库分开保存，使其能够完成自己的任务，并允许随着时间的推移对数据库模式进行更改。

I've looked into this a little on my own, and here are some comments to share.

Although I had thought using mercurial from python would make things easier, there's a lot of functionality that the DVCS's have that aren't necessary (esp branching, merging). I think it would be easier to simply steal some design decisions and implement a basic system for my needs. So, here's what I came up with.

Blobs

The system makes a json representation of the record to be archived, and generates a SHA-1 hash of this (a "node ID" if you will). This hash represents the state of that record at a given point in time and is the same as git's "blob".

Changesets

Changes are grouped into changesets. A changeset takes note of some metadata (timestamp, committer, etc) and links to any parent changesets and the current "tree".

Trees

Instead of using Mercurial's "Manifest" approach, I've gone for git's "tree" structure. A tree is simply a list of blobs (model instances) or other trees. At the top level, each database table gets its own tree. The next level can then be all the records. If there are lots of records (there often are), they can be split up into subtrees.

Doing this means that if you only change one record, you can leave the untouched trees alone. It also allows each record to have its own blob, which makes things much easier to manage.

Storage

I like Mercurial's revlog idea, because it allows you to minimise the data storage (storing only changesets) and at the same time keep retrieval quick (all changesets are in the same data structure). This is done on a per record basis.

I think a system like MongoDB would be best for storing the data (It has to be key-value, and I think Redis is too focused on keeping everything in memory, which is not important for an archive). It would store changesets, trees and revlogs. A few extra keys for the current HEAD etc and the system is complete.

Because we're using trees, we probably don't need to explicitly link foreign keys to the exact "blob" it's referring to. Justing using the primary key should be enough. I hope!

Use case: 1. Archiving a change

As soon as a change is made, the current state of the record is serialised to json and a hash is generated for its state. This is done for all other related changes and packaged into a changeset. When complete, the relevant revlogs are updated, new trees and subtrees are generated with the new object ("blob") hashes and the changeset is "committed" with meta information.

Use case 2. Retrieving an old state

After finding the relevant changeset (MongoDB search?), the tree is then traversed until we find the blob ID we're looking for. We go to the revlog and retrieve the record's state or generate it using the available snapshots and changesets. The user will then have to decide if the foreign keys need to be retrieved too, but doing that will be easy (using the same changeset we started with).

Summary

None of these operations should be too expensive, and we have a space efficient description of all changes to a database. The archive is kept separately to the production database allowing it to do its thing and allowing changes to the database schema to take place over time.

回复收藏 0 原文

瑾兮 2024-11-23 03:13:52

如果数据库的写入量不是很大（正如您所说），为什么不以实现您目标的方式实现实际的数据库表呢？例如，添加“版本”列。然后永远不要更新或删除行，除了这个特殊列，您可以将其设置为 NULL 表示“当前”，1 表示“最古老的已知”，然后从那里向上移动。当您想要更新一行时，将其版本设置为下一个更高版本，然后插入一个没有版本的新行。然后，当您查询时，只需选择具有空版本的行即可。