数据库内容版本控制
我有兴趣保留数据库中某些表上发生的每个更改的运行历史记录,从而能够重建数据库的历史状态以进行分析。
我正在使用 Postgres,这个 MVCC 东西似乎我应该能够利用它来实现此目的,但我找不到任何文档来支持这一点。我可以做吗?有更好的办法吗?
任何意见表示赞赏!
UPD
我已将 Denis 的回复标记为答案,因为他实际上回答了 MVCC 是否是我想要的问题。然而,我已经确定的策略详述如下,以防有人觉得有用:
Postgres 功能可以实现我想要的功能:在线备份/时间点恢复。
http://www.postgresql.org/docs/8.1/static/backup -online.html 解释了如何使用此功能,但本质上您可以将此“预写日志”设置为存档模式,拍摄数据库快照(例如,在其上线之前),然后不断存档 WAL。然后,您可以使用日志重播随时调用数据库的状态,如果您选择的话,还可以享受热备用(通过在备用服务器上不断重播新的 WAL)的附带好处。
也许这种方法不如其他保存历史记录的方法那么优雅,因为您需要实际为您希望查询的每个时间点构建数据库,但是它看起来非常容易设置并且丢失零信息。这意味着当我有时间改进对历史数据的处理时,我将拥有一切,因此能够将我笨重的系统转变为更优雅的系统。
使其如此完美的一个关键事实是,我的“有效时间”与特定应用程序的“交易时间”相同 - 如果不是这种情况,我只会捕获“交易时间”。
在我发现 WAL 之前,我考虑只拍摄每日快照之类的东西,但大尺寸要求和涉及的数据丢失并不适合我。
对于一种快速启动和运行而不从一开始就影响数据保留的方法,这似乎是完美的解决方案。
I am interested in keeping a running history of every change which has happened on some tables in my database, thus being able to reconstruct historical states of the database for analysis purposes.
I am using Postgres, and this MVCC thing just seems like I should be able to exploit it for this purpose but I cannot find any documentation to support this. Can I do it? Is there a better way?
Any input is appreciated!
UPD
I have marked Denis' response as the answer, because he did in fact answer whether MVCC is what I want which was the question. However, the strategy I have settled on is detailed below in case anyone finds it useful:
The Postgres feature that does what I want: online backup/point in time recovery.
http://www.postgresql.org/docs/8.1/static/backup-online.html explains how to use this feature but essentially you can set this "write ahead log" to archive mode, take a snapshot of the database (say, before it goes live), then continually archive the WAL. You can then use log replay to recall the state of the database at any time, with the side benefit of having a warm standby if you choose (by continually replaying the new WALs on your standby server).
Perhaps this method is not as elegant as other ways of keeping a history, since you need to actually build the database for every point in time you wish to query, however it looks extremely easy to set up and loses zero information. That means when I have the time to improve my handling of historical data, I'll have everything and will therefore be able to transform my clunky system to a more elegant system.
One key fact that makes this so perfect is that my "valid time" is the same as my "transaction time" for the specific application- if this were not the case I would only be capturing "transaction time".
Before I found out about the WAL, I was considering just taking daily snapshots or something but the large size requirement and data loss involved did not sit well with me.
For a quick way to get up and running without compromising my data retention from the outset, this seems like the perfect solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
时间旅行
PostgreSQL曾经只有这个功能,并称之为“时间旅行”。请参阅旧文档。
您可以在 spi contrib 模块中找到一些类似的功能可能想查看一下。
复合类型审计触发器
我通常做的是使用触发器将更改连同时间戳一起记录到存档表中,并针对这些进行查询。如果表结构不会改变,您可以使用类似的东西:
并且您的版本控制触发器可以
insert into sometable_history(TG_OP,NEW,current_timestamp)
(使用不同的CASE
用于DELETE
,其中NEW
未定义)。hstore 审计触发器
不过,如果架构发生更改以添加新的
NOT NULL
列,这会变得很痛苦。如果您希望执行类似的操作,请考虑使用hstore
来归档列,而不是复合类型。我已经在 PostgreSQL wiki 上添加了该实现。PITR
如果您想避免对主数据库产生影响(不断增长的表等),您可以交替使用 连续归档和时间点恢复来记录 WAL 文件,这些文件可以使用
recovery.conf
重播到任何时刻。请注意,WAL 文件很大,它们不仅包含您更改的元组,还包含 VACUUM 活动和其他详细信息。您需要通过 clearxlogtail 运行它们,因为如果它们是存档超时的部分片段,那么您将需要对其进行大量压缩以进行长期存储。Time Travel
PostgreSQL used to have just this feature, and called it "Time Travel". See the old documentation.
There's somewhat similar functionality in the spi contrib module that you might want to check out.
Composite type audit trigger
What I usually do instead is to use triggers to log changes along with timestamps to archival tables, and query against those. If the table structure isn't going to change you can use something like:
and your versioning trigger can just
insert into sometable_history(TG_OP,NEW,current_timestamp)
(with a differentCASE
forDELETE
, whereNEW
is not defined).hstore audit trigger
That gets painful if the schema changes to add new
NOT NULL
columns though. If you expect to do anything like that consider using ahstore
to archive the columns, instead of a composite type. I've already added an implementation of that on the PostgreSQL wiki already.PITR
If you want to avoid impact on your master database (growing tables, etc), you can alternately use continuous archiving and point-in-time recovery to log WAL files that can, using a
recovery.conf
, be replayed to any moment in time. Note that WAL files are big and they include not only the tuples you changed, butVACUUM
activity and other details. You'll want to run them through clearxlogtail since they can have garbage data on the end if they're partial segments from an archive timeout, then you'll want to compress them heavily for long term storage.并不真地。有一些工具可以查看死行,因为自动清理最终会被回收。
如果我的问题正确,那么您正在考虑记录缓慢变化的维度。
您可能会发现最近的相关主题很有趣:
时态数据库设计,有一点不同(实时行与草稿行)
Not really. There are tools to see dead rows, because auto-vacuuming is so that will eventually be reclaimed.
If I get your question right, you're looking into logging slowly changing dimensions.
You might find this recent related thread interesting:
Temporal database design, with a twist (live vs draft rows)
我不知道有任何为此目的而构建的工具/产品。
虽然这可能不完全是您所要求的,但您可以配置 Postgresql 来记录 ddl 更改。设置 log_line_prefix 参数(尝试包括 %d、%m 和 %u)并将 log_statement 参数设置为 ddl 应该可以为您提供有关谁在何时进行了哪些 ddl 更改的合理历史记录。
话虽如此,我不认为记录 ddl 是万无一失的。例如,考虑这样一种情况:
另一种选择可能是如上所述记录 ddl,但每当记录 ddl 条目时,让观察程序执行数据库模式的 pg_dump。您甚至可以将新转储与以前的转储进行比较,并仅提取已更改的对象。
I'm not aware of any tools/products that are built for that purpose.
While this may not be exactly what you're asking for, you can configure Postgresql to log ddl changes. Setting the log_line_prefix parameter (try including %d, %m, and %u) and setting the log_statement parameter to ddl should give you a reasonable history of who made what ddl changes and when.
Having said that, I don't believe logging ddl to be foolproof. For example, consider a situation where:
Another option might be to log ddl as above but then have a watcher program perform a pg_dump of the database schema whenever a ddl entry get's logged. You could even compare the new dump with the previous dump and extract just the objects that were changed.