在 PostgreSQL 中存储二进制文件的多个修订版本的最有效方法是什么?
我正在这里的数据库中寻找有限形式的版本控制:
- 许多修订应该占用尽可能小的空间(我不是在寻找压缩,因为数据已经被压缩)
- 大小是最重要的:同一文件的 要求是次要的
- 我应该能够尽快获取文档的当前版本(获取旧版本不是时间关键)
基本上答案应该至少包含两件事:
- 您将使用什么二进制差异算法?
- 您将如何以 PostreSQL 特有的方式构建这个系统?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
“大小是最重要的”:使用 bsdiff?)怎么样?例如 href="http://plsh.projects.postgresql.org/" rel="nofollow">PL/sh 。
“我应该能够尽快获取文档的当前版本”:在这种情况下,您将希望以“错误”的方式进行比较,因此每个版本都将涉及:
要返回到旧修订版,则需要迭代地将以前的差异应用为补丁,直到获得所需的修订版。
无论你做什么,我认为在使用 diff 工具之前你需要先解压缩数据。原因如下:
请注意,
gz.diff
大于diff
- 如果您在真实文件中尝试此操作,我预计差异会更大。"Size is of greatest importance": how about an external diff tool (like bsdiff?) using PL/sh for example.
"I should be able to fetch the current revision of the document as fast as possible": In which case you will want to do your diff the 'wrong' way round, so each revision would involve:
To get back to an old revision would then require iteratively applying previous diffs as patches until you get to the revision you need.
Whatever you do, I think you will need to uncompress the data first before using the diff tool. Here's why:
Note that
gz.diff
is larger thandiff
- if you try this with real files I expect the difference to be even larger.我真的很不喜欢重新发明轮子。在存储空间优化方面,比我聪明得多的人已经找到了解决方案。如果可能的话,我更愿意利用这些真正聪明的人的辛勤工作。话虽如此,一旦我了解了它们如何存储二进制数据,我可能会考虑将我的文件存储在 Mercurial 或 Git 等版本控制系统中。一旦你弄清楚你想使用哪一个,你就可以看看如何在 pl/perl 或类似的函数中创建一些存储函数,这些函数可以与版本控制系统交互,并弥合 PostgreSQL 中的关系数据和二进制文件之间的差距。文件。
我对这种方法的唯一问题是,我真的不喜欢我采用事务系统并在其中引入外部系统(Mercurial/Git)。最重要的是,数据库备份不会备份我的 Mercurial 或 Git 存储库。但总会有一个权衡,所以只要弄清楚你可以接受哪些。
I tend to really dislike re-inventing the wheel. In the case of storage space optimization people way smarter than me figured out solutions already. I'd prefer, when possible to leverage the hard work of these really smart people. With that said I might consider looking into storing my files in a revision control system such as Mercurial or Git,once I understand how they store binary data. Once you figure out which one you want to use you can look at ways of creating some stored functions most likely in pl/perl or one similar that can interact with the version control system and bridge the gap between your relation data in PostgreSQL and the binary files.
My only issue with this approach is that I don't really like that I took a transactional system and have introduced an outside system(Mercurial/Git) into it. And on top of that a backup of the database won't backup my Mercurial or Git repository. But there will always be a trade off so just figure out which ones you can live with.