SVN 存储库大小莫名其妙地从小差异增加到大文件

发布于 2024-11-27 08:12:42 字数 1487 浏览 6 评论 0原文

我不明白为什么大文件的微小差异会导致我的颠覆存储库增长如此之多。

我有一些测试使用的数据库内容的 zip 文件。我想将每个新版本的测试数据存储在我们的 Subversion 存储库中。

我做了一些实验,检查了 data.zip 的最后几个版本,并查看了存储库大小的变化。未压缩的数据约为 150MB,压缩后的数据约为 50MB。签入存储库的每个新版本的 data.zip 文件都会使存储库的大小增加约 50MB。我认为它只会增加增量,而我预计增量会少得多。

Subversion 使用 xdelta 来存储压缩的差异数据。我尝试确认 SVN 可以做得更好,方法是下载 xdelta 并检查两个版本之间没有太大差异。确实

xdelta3.0z.x86-64.exe -e -s v1_path\data.zip v2_path\data.zip v1v2_delta.file

生成了一个大约 3MB 的 v1v2_delta.file。

我查看了 SVN 存储库 [myrepo]\db\revs,可以看到每个新修订版的大文件

02/08/2011  11:12        57,853,082 4189
02/08/2011  11:40        51,713,289 4190
02/08/2011  11:46        52,286,060 4191

(4189、4190 和 4191 是文件名。)

我什至尝试在不压缩的情况下压缩 data.zip。这对 SVN 存储的内容没有影响 - 从它的外观来看,我的猜测是它为每个修订版存储整个 data.zip 的压缩副本,而不仅仅是第一个修订版。我正在运行带有 FSFS 后端的 SVN 1.6。

关于提交二进制文件以及 SVN 如何存储增量,还有其他各种很好的 stackoverflow 答案,例如 多次修订后的 SVN 性能。但我从这些中看不出为什么增量没有存储在上述情况下 - 即。如果 xdelta 可以在独立运行时获得如此小的差异,那么 SVN 肯定也可以 - 或者它选择不这样做?!

编辑:我也尝试过 tar(未压缩)文件,SVN 再次无法有效地存储它们。我还发现我们在不同的存储库中有一个数据格式相同(尽管小得多)的 zip 文件,其中 SVN刚刚存储了差异

所以这个问题的总结版本是:SVN可以有效地存储二进制文件,例如 10 个稍有不同的 CAD 文件大小仅为 1 的 1.2 倍。有时,SVN 甚至可以通过压缩 zip 文件来节省空间。但显然,二进制文件并不总是节省空间 - 在什么情况下会出现这种情况?

I can't figure out why small differences to big files are causing my subversion repository to grow so much.

I have a zip file of the contents a database used by some tests. I want to store each new version of the test data in our subversion repository.

I've done some experiments, checking in the last few versions of the data.zip and looking at what happens to the size of the repository. The uncompressed data is about 150MB, compressed and zipped it's ~50MB. Each new version of the data.zip file checked into the repository increases the repository's size by about 50MB. I think it should only increase by the amount of a delta which I expect to be much less.

Subversion uses xdelta to store compressed difference data. My attempt to confirm that SVN could do better was to download xdelta and check there isn't much difference between two versions. Indeed

xdelta3.0z.x86-64.exe -e -s v1_path\data.zip v2_path\data.zip v1v2_delta.file

produced a v1v2_delta.file which was about 3MB.

I've looked in the SVN repository at [myrepo]\db\revs and can see large files for each new revision

02/08/2011  11:12        57,853,082 4189
02/08/2011  11:40        51,713,289 4190
02/08/2011  11:46        52,286,060 4191

(The 4189, 4190 and 4191 are the names of files.)

I even tried zipping the data.zip without compression. This didn't make a difference to what SVN stores - from the look of it, my guess is that it is storing a compressed copy of the entire data.zip for every revision, not just the first. I'm running SVN 1.6 with an FSFS backend.

There are various other good stackoverflow answers about committing binaries and how SVN stores deltas, e.g. SVN performance after many revisions. But I cannot see from these why deltas aren't being stored in the above case - ie. if xdelta can get such a small diff running standalone, surely SVN can too - or is it choosing not to?!

Edit: I've also tried tar (uncompressed) files, again SVN isn't storing them efficiently. Also I found that we have a zip file of the same data format (although much smaller) in a different repository where SVN has just stored diffs.

So the summarized version of this question is: SVN can efficiently store binary files, e.g. 10 slightly different CAD files are just 1.2 times the size of 1. SVN even can be space efficient with compressed zip files sometimes. But evidently it isn't always space efficient with binary files - under what conditions is this the case?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

琉璃梦幻 2024-12-04 08:12:43

总结

Subversion 有时会比 xdelta 独立版更糟糕,因为压缩需要占用多少内存。从版本 1.6 开始,这是当前无法更改的颠覆行为。

详细信息

我在 subversion 邮件列表上询问为什么subversion 存储库文件似乎比应有的大

结论是如果你给它更多的内存,xdelta 可以产生更小的增量< /a>.

回顾一下这个帖子另一个遇到同样问题的人的例子。

为此,我要感谢最近和四年前颠覆邮件列表中的各个人。

也遇到这个问题?

如果您正在分析 subversion 存储库的磁盘使用情况,请了解 跳过增量 并使用此 grep DELTA 技巧 找出用于 delta 的基数。

假设,像我一样,您确实想在存储库中存储二进制文件,这是我对一些解决方法的猜测(它们都不是很容易!):

  1. 修改 subversion 源代码并构建您自己的 xdelta 内存窗口设置为更大
  2. 你是否拥有 xdelta-ing - 将增量检查到源代码管理中,并有一些疯狂的重建过程迁移
  3. 到 Git - 它肯定有更好的压缩(疯狂猜测)

Summary

Subversion will sometimes be worse than xdelta standalone because of how much memory is given to the compression. This is subversion behaviour that can't currently be changed, as of version 1.6.

Details

I asked on the subversion mailing list why the subversion repository files seemed to be bigger than they should be.

The conclusion is that xdelta can produce a smaller delta if you give it more memory.

Read back in this thread another example of someone else who had the same problem.

With credit and thanks to various people on subversion mailing lists recently and four years ago for this.

Also having this problem?

If you're analysing disk usage by the subversion repository, understand skip deltas and use this grep DELTA trick to figure out the base being used for the delta.

And assuming, like me, you really do want to store binary files in the repository, here's my guess at some workarounds (none of them very easy!):

  1. Modify the subversion source code and build your own with the xdelta memory window set to be bigger
  2. Do you own xdelta-ing - check the deltas into source control and have some crazy ass process for reconstructing
  3. Migrate to Git - it's bound to have better compression (wild speculation)
陌若浮生 2024-12-04 08:12:43

我认为压缩将完全改变二进制文件的构成,因此 svn 将不得不存储巨大的增量。即使更改压缩文件内容的几个字符也可能会彻底改变它。

在源代码管理中存储二进制文件通常是一个坏主意,我认为您应该寻找替代方案。

I would think that the compression will completely change the makeup of the binary file, therefore svn will have to store huge deltas. Even changing a few characters of the contents of a compressed file can drastically change it.

Storing binaries in source control is generally a bad idea and I think you should look for an alternative.

独﹏钓一江月 2024-12-04 08:12:43

在压缩存档中添加或修改文件时,压缩文件的二进制内容可能会发生巨大变化。认为存档的特定元素可能会发生更改,而压缩文件的大面积区域不会发生重大更改。然而,在正常情况下会出现这种情况是一个“运气”问题(当然这没有真正的运气,但计划实现它有点复杂)

这在熵编码算法中是很正常的,例如霍夫曼(仅举最简单的一个),因为当添加或修改文件时符号的频率会发生变化。如果这种情况发生在存档内容的开头,则可能会严重影响更改后文件的整个内容。

Compressed files binary content might change drastically when files are added or modified in a compressed archive. Thought it can happen that changes can take place in particular elements of the archive and no significant changes happen in large areas of the compressed file file. However, it is a matter of "luck" that this will be the case in normal cases (of course there is no real luck in this but it is a bit complex to plan on achieving it)

This is quite normal in entropy encoding algorithms, such as Huffman (to name the simplest one), as the frequencies of the symbols change when files are added or modified. If this takes place at the beginning of the archive's contents, it can severely affect the entire content of the file following the change.

你是年少的欢喜 2024-12-04 08:12:43

您是否使用了 fsfs 文件系统支持?我记得,它每次都会存储一个新副本(尽管它可能被压缩)。为什么你期望 SVN 存储二进制文件的差异? SVN 是一个源代码控制系统(即文本),而不是一个通用的二进制控制系统(尽管它在存储二进制文件方面没有那么糟糕)。

Did you use the fsfs file system backing? As I recall, it stores a new copy each time (although it may be compressed). Why are you expecting SVN to store diffs of binary files? SVN is a source code control system (meaning text) not a general binary control system (although it doesn't do as badly as it could with storing binaries).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文