SVN 存储库大小莫名其妙地从小差异增加到大文件

发布于 2024-11-27 08:12:42 字数 1487 浏览 6 评论 0原文

我不明白为什么大文件的微小差异会导致我的颠覆存储库增长如此之多。

我有一些测试使用的数据库内容的 zip 文件。我想将每个新版本的测试数据存储在我们的 Subversion 存储库中。

我做了一些实验，检查了 data.zip 的最后几个版本，并查看了存储库大小的变化。未压缩的数据约为 150MB，压缩后的数据约为 50MB。签入存储库的每个新版本的 data.zip 文件都会使存储库的大小增加约 50MB。我认为它只会增加增量，而我预计增量会少得多。

Subversion 使用 xdelta 来存储压缩的差异数据。我尝试确认 SVN 可以做得更好，方法是下载 xdelta 并检查两个版本之间没有太大差异。确实

xdelta3.0z.x86-64.exe -e -s v1_path\data.zip v2_path\data.zip v1v2_delta.file

生成了一个大约 3MB 的 v1v2_delta.file。

我查看了 SVN 存储库 [myrepo]\db\revs，可以看到每个新修订版的大文件

02/08/2011  11:12        57,853,082 4189
02/08/2011  11:40        51,713,289 4190
02/08/2011  11:46        52,286,060 4191

（4189、4190 和 4191 是文件名。）

我什至尝试在不压缩的情况下压缩 data.zip。这对 SVN 存储的内容没有影响 - 从它的外观来看，我的猜测是它为每个修订版存储整个 data.zip 的压缩副本，而不仅仅是第一个修订版。我正在运行带有 FSFS 后端的 SVN 1.6。

关于提交二进制文件以及 SVN 如何存储增量，还有其他各种很好的 stackoverflow 答案，例如多次修订后的 SVN 性能。但我从这些中看不出为什么增量没有存储在上述情况下 - 即。如果 xdelta 可以在独立运行时获得如此小的差异，那么 SVN 肯定也可以 - 或者它选择不这样做？！

编辑：我也尝试过 tar（未压缩）文件，SVN 再次无法有效地存储它们。我还发现我们在不同的存储库中有一个数据格式相同（尽管小得多）的 zip 文件，其中 SVN刚刚存储了差异。

所以这个问题的总结版本是：SVN可以有效地存储二进制文件，例如 10 个稍有不同的 CAD 文件大小仅为 1 的 1.2 倍。有时，SVN 甚至可以通过压缩 zip 文件来节省空间。但显然，二进制文件并不总是节省空间 - 在什么情况下会出现这种情况？

原文

I can't figure out why small differences to big files are causing my subversion repository to grow so much.

I have a zip file of the contents a database used by some tests. I want to store each new version of the test data in our subversion repository.

I've done some experiments, checking in the last few versions of the data.zip and looking at what happens to the size of the repository. The uncompressed data is about 150MB, compressed and zipped it's ~50MB. Each new version of the data.zip file checked into the repository increases the repository's size by about 50MB. I think it should only increase by the amount of a delta which I expect to be much less.

Subversion uses xdelta to store compressed difference data. My attempt to confirm that SVN could do better was to download xdelta and check there isn't much difference between two versions. Indeed

xdelta3.0z.x86-64.exe -e -s v1_path\data.zip v2_path\data.zip v1v2_delta.file

produced a v1v2_delta.file which was about 3MB.

I've looked in the SVN repository at [myrepo]\db\revs and can see large files for each new revision

02/08/2011  11:12        57,853,082 4189
02/08/2011  11:40        51,713,289 4190
02/08/2011  11:46        52,286,060 4191

(The 4189, 4190 and 4191 are the names of files.)

I even tried zipping the data.zip without compression. This didn't make a difference to what SVN stores - from the look of it, my guess is that it is storing a compressed copy of the entire data.zip for every revision, not just the first. I'm running SVN 1.6 with an FSFS backend.

There are various other good stackoverflow answers about committing binaries and how SVN stores deltas, e.g. SVN performance after many revisions. But I cannot see from these why deltas aren't being stored in the above case - ie. if xdelta can get such a small diff running standalone, surely SVN can too - or is it choosing not to?!

Edit: I've also tried tar (uncompressed) files, again SVN isn't storing them efficiently. Also I found that we have a zip file of the same data format (although much smaller) in a different repository where SVN has just stored diffs.

So the summarized version of this question is: SVN can efficiently store binary files, e.g. 10 slightly different CAD files are just 1.2 times the size of 1. SVN even can be space efficient with compressed zip files sometimes. But evidently it isn't always space efficient with binary files - under what conditions is this the case?

分享到QQ

分享到微博