在 PostgreSQL 中存储二进制文件的多个修订版本的最有效方法是什么?

发布于 2024-10-30 05:37:31 字数 253 浏览 1 评论 0 原文

我正在这里的数据库中寻找有限形式的版本控制:

  • 许多修订应该占用尽可能小的空间(我不是在寻找压缩,因为数据已经被压缩)
  • 大小是最重要的:同一文件的 要求是次要的
  • 我应该能够尽快获取文档的当前版本(获取旧版本不是时间关键)

基本上答案应该至少包含两件事:

  • 您将使用什么二进制差异算法?
  • 您将如何以 PostreSQL 特有的方式构建这个系统?

I'm looking for a limited form of version control in the database here:

  • Size is of greatest importance: many revisions of the same file should occupy the smallest space possible (I'm not looking for compression since the data is already compressed)
  • Computational requirements are secondary
  • I should be able to fetch the current revision of the document as fast as possible (fetching older versions is not time-critical)

Basically answers should contain at least two things:

  • What binary diff algorithm would you use?
  • How would you structure this system in a way specific to PostreSQL?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

负佳期 2024-11-06 05:37:31

“大小是最重要的”:使用 bsdiff?)怎么样?例如 href="http://plsh.projects.postgresql.org/" rel="nofollow">PL/sh 。

“我应该能够尽快获取文档的当前版本”:在这种情况下,您将希望以“错误”的方式进行比较,因此每个版本都将涉及:

  1. 将“先前版本”替换为之间的差异“新修订版”和“上一个修订版”
  2. 添加“新修订版”

要返回到旧修订版,则需要迭代地将以前的差异应用为补丁,直到获得所需的修订版。

无论你做什么,我认为在使用 diff 工具之前你需要先解压缩数据。原因如下:

dd if=/dev/urandom of=myfile.1 bs=1024 count=10
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
dd if=/dev/urandom of=myfile.2 bs=1024 count=10
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cat myfile.1 >> myfile.2
bsdiff myfile.1 myfile.2 diff
gzip -c myfile.1 > myfile.1.gz
gzip -c myfile.2 > myfile.2.gz
bsdiff myfile.1.gz myfile.2.gz gz.diff
rm tmp
ls -l

-rw-r--r-- 1 root root  17115 2011-04-05 10:54 diff
-rw-r--r-- 1 root root  21580 2011-04-05 10:54 gz.diff
-rw-r--r-- 1 root root 163840 2011-04-05 10:54 myfile.1
-rw-r--r-- 1 root root  11709 2011-04-05 10:54 myfile.1.gz
-rw-r--r-- 1 root root 327680 2011-04-05 10:54 myfile.2
-rw-r--r-- 1 root root  23399 2011-04-05 10:54 myfile.2.gz

请注意,gz.diff 大于 diff - 如果您在真实文件中尝试此操作,我预计差异会更大。

"Size is of greatest importance": how about an external diff tool (like bsdiff?) using PL/sh for example.

"I should be able to fetch the current revision of the document as fast as possible": In which case you will want to do your diff the 'wrong' way round, so each revision would involve:

  1. replace 'previous revision' with diff between 'new revision' and 'previous revision'
  2. add 'new revision'

To get back to an old revision would then require iteratively applying previous diffs as patches until you get to the revision you need.

Whatever you do, I think you will need to uncompress the data first before using the diff tool. Here's why:

dd if=/dev/urandom of=myfile.1 bs=1024 count=10
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
dd if=/dev/urandom of=myfile.2 bs=1024 count=10
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cat myfile.1 >> myfile.2
bsdiff myfile.1 myfile.2 diff
gzip -c myfile.1 > myfile.1.gz
gzip -c myfile.2 > myfile.2.gz
bsdiff myfile.1.gz myfile.2.gz gz.diff
rm tmp
ls -l

-rw-r--r-- 1 root root  17115 2011-04-05 10:54 diff
-rw-r--r-- 1 root root  21580 2011-04-05 10:54 gz.diff
-rw-r--r-- 1 root root 163840 2011-04-05 10:54 myfile.1
-rw-r--r-- 1 root root  11709 2011-04-05 10:54 myfile.1.gz
-rw-r--r-- 1 root root 327680 2011-04-05 10:54 myfile.2
-rw-r--r-- 1 root root  23399 2011-04-05 10:54 myfile.2.gz

Note that gz.diff is larger than diff - if you try this with real files I expect the difference to be even larger.

半步萧音过轻尘 2024-11-06 05:37:31

我真的很不喜欢重新发明轮子。在存储空间优化方面,比我聪明得多的人已经找到了解决方案。如果可能的话,我更愿意利用这些真正聪明的人的辛勤工作。话虽如此,一旦我了解了它们如何存储二进制数据,我可能会考虑将我的文件存储在 Mercurial 或 Git 等版本控制系统中。一旦你弄清楚你想使用哪一个,你就可以看看如何在 pl/perl 或类似的函数中创建一些存储函数,这些函数可以与版本控制系统交互,并弥合 PostgreSQL 中的关系数据和二进制文件之间的差距。文件。

我对这种方法的唯一问题是,我真的不喜欢我采用事务系统并在其中引入外部系统(Mercurial/Git)。最重要的是,数据库备份不会备份我的 Mercurial 或 Git 存储库。但总会有一个权衡,所以只要弄清楚你可以接受哪些。

I tend to really dislike re-inventing the wheel. In the case of storage space optimization people way smarter than me figured out solutions already. I'd prefer, when possible to leverage the hard work of these really smart people. With that said I might consider looking into storing my files in a revision control system such as Mercurial or Git,once I understand how they store binary data. Once you figure out which one you want to use you can look at ways of creating some stored functions most likely in pl/perl or one similar that can interact with the version control system and bridge the gap between your relation data in PostgreSQL and the binary files.

My only issue with this approach is that I don't really like that I took a transactional system and have introduced an outside system(Mercurial/Git) into it. And on top of that a backup of the database won't backup my Mercurial or Git repository. But there will always be a trade off so just figure out which ones you can live with.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文