不同的版本控制系统如何处理二进制文件?
我听说过一些说法,SVN 比 Git/Mercurial 更好地处理二进制文件。这是真的吗?如果是的话,为什么?据我想象,没有版本控制系统(VCS)可以区分和合并相同二进制资源的两个修订版之间的更改。
那么,不是所有的 VCS 都不擅长处理二进制文件吗?我不太清楚特定 VCS 实现背后的技术细节,所以也许它们有一些优点和缺点。
I have heard some claims that SVN handles binary files better than Git/Mercurial. Is this true and if so then why? As far as I can imagine, no version control system (VCS) can diff and merge changes between two revisions of the same binary resources.
So, aren't all VCS's bad at handling binary files? I am not very aware of the technical details behind particular VCS implementations so maybe they have some pros and cons.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
主要痛点在于任何 DVCS 的“分布式”方面:您正在克隆所有内容(所有文件的所有历史记录),
因为大多数二进制文件并不存储在增量中,并且“无论是压缩文件还是文本文件,如果您存储快速发展的二进制文件,您很快就会得到一个大型存储库,而移动(推/拉)变得非常麻烦。
以 Git 为例,请参阅git 限制是什么?。
二进制文件不太适合 VCS 可以带来的功能(差异、分支、合并),并且可以在工件存储库中更好地进行管理(例如 Nexus)。
对于 CVCS(集中式 VCS)来说,这不是必需的,其中存储库可以扮演该角色并成为二进制文件的存储(即使它不是其主要角色)
The main pain point is in the "Distributed" aspect of any DVCS: you are cloning everything (the all history of all files)
Since binaries aren't stored in delta for most of them, and aren't compressed as well as text file, if you are storing rapidly evolving binaries, you end up quickly with a large repository which becomes much cumbersome to move around (push/pull).
For Git for instance, see What are the git limits?.
Binaries aren't a good fit for the feature a VCS can bring (diff, branch, merge), and are better managed in an artifact repository (like a Nexus for example).
This is not necessary the case for a CVCS (Centralized VCS) where the repository could play that role and be a storage for binaries (even if its not its primary role)
关于 git 和二进制文件的一项澄清。
Git 正在压缩二进制文件和文本文件。所以 git 并不像有人建议的那样处理二进制文件。
Git 添加的任何文件都将被压缩为松散对象。它们是二进制还是文本并不重要。如果您有二进制或文本文件并提交它,存储库将会增长。如果您对文件进行较小的更改并再次提交,您的存储库将再次以大约相同的数量增长,具体取决于压缩率。
然后你创建一个
git gc
。 Git 会发现二进制或文本文件中的相似之处并将它们压缩在一起。如果相似度很大,您将获得良好的压缩效果。另一方面,如果文件之间没有相似之处,那么与单独压缩它们相比,将它们压缩在一起不会获得太大的收益。
这是一个使用位图图片(二进制)的测试,我做了一些更改:
One clarification about git and binary files.
Git is compressing binary files as well as text files. So git is not crap at handling binary files as someone suggested.
Any file that Git adds will be compressed into loose objects. It doesn't matter if they are binary or text. If you have a binary or text file and you commit it, the repository will grow. If you make a minor change to the file and commit again your repository will grow again at approximately the same amount depending on the compression ratio.
Then you make a
git gc
. Git will find similarities in the binary or text files and compress them together. You will have a good compression if the similarities are large.If, on the other hand there are no similarities between the files, you will not have much of a gain compressing them together compared to compressing them individually.
Here is a test with a bit-mapped picture (binary) that I changed a little:
Git 和 Mercurial 都可以轻松地处理二进制文件。它们不会被破坏,并且您可以签入和签出它们。问题在于规模。
源文件通常比二进制文件占用更少的空间。您可能有 100K 的源文件来构建 100Mb 的二进制文件。因此,在我的存储库中存储单个构建可能会导致其大小增长 30 倍。
更糟糕的是:
版本控制系统通常通过某种形式的 diff 格式存储文件。假设我有一个 100 行的文件,每行平均大约 40 个字符。整个文件大小为 4K。如果我更改该文件中的一行并保存该更改,我只会在存储库的大小中添加大约 60 个字节。
现在,假设我编译并添加了那个 100Mb 文件。我对源代码进行了更改(可能进行了 10K 左右的更改),重新编译并存储新的二进制版本。好吧,二进制文件通常差异不大,因此我很可能会向我的存储库添加另外 100Mb 的大小。进行几次构建后,我的存储库大小增长到几 GB,但我的存储库的源部分只有几十 KB。
Git 和 Mercurial 的问题在于您通常将整个存储库检出到您的系统上。我现在不是仅仅下载可以在几秒钟内传输的几十千字节的文件,而是下载几千兆字节的构建以及几十千字节的数据。
也许人们说 Subversion 更好,因为我可以简单地在 Subversion 中签出我想要的版本,而不用下载整个存储库。然而,Subversion 并没有为您提供一种简单的方法来从存储库中删除过时的二进制文件,因此您的存储库无论如何都会不断增长。我还是不推荐。哎呀,即使版本控制系统确实允许您删除过时二进制文件的旧版本,我什至不推荐它。 (Perforce、ClearCase 和 CVS 都可以)。这最终会成为一个令人头疼的维护问题。
现在,这并不是说您不应该存储任何二进制文件。例如,如果我正在制作一个网页,我可能有一些我需要的 gif 和 jpeg。将它们存储在 Subversion 或 Git/Mercurial 中都没有问题。它们相对较小,并且可能比我的代码本身变化少得多。
你不应该存储的是构建对象。这些应该存储在发布存储库中并根据需要获取。 Maven 和 Ant w/ Ivy 在这方面做得很好。而且,您也可以在 C、C++ 和 C# 项目中使用 Maven 存储库结构。
Git and Mercurial both handle binary files with aplomb. Thet don't corrupt them, and you can check them in and out. The problem is one of size.
Source usually takes up less room than binary files. You might have 100K of source files that build a 100Mb binary. Thus, storing a single build in my repository could cause it to grow 30 times its size.
And it's even worse:
Version control systems usually store files via some form of diff format. Let's say I have a file of 100 lines and each line averages about 40 characters. That entire file is 4K in size. If I change a line in that file, and save that change, I'm only adding about 60 bytes to the size of my repository.
Now, let's say I compiled and added that 100Mb file. I make a change in my source (maybe 10K or so in changes), recompile, and store the new binary build. Well, binaries don't usually diff very well, so it's very likely I'm adding another 100Mb of size to my repository. Do a few builds, and my repository size grows to several gigabytes in size, yet the source portion of my repository is till only a few dozen kilobytes.
The problem with Git and Mercurial is that you normally checkout the entire repository onto your system. Instead of merely downloading a few dozen kilobytes that can be transfered in a few seconds, I am now downloading several gigabytes of builds along with the few dozen kilobytes of data.
Maybe people say Subversion is better since I can simply checkout the version I want in Subversion and not download the whole repository. However, Subversion doesn't give you an easy way to remove obsolete binaries from your repository, so your repository will grow and grow anyway. I still don't recommend it. Heck, I don't even recommend it even if the revision control system does allow you to remove old revisions of obsolete binaries. (Perforce, ClearCase, and CVS all do). It's just ends up being a big maintenance headache.
Now, this isn't to say you shouldn't store any binary files. For example, if I am making a web page, I probably have some gifs and jpegs that I need. No problem storing those in either Subversion or Git/Mercurial. They're relatively small, and probably change a lot less than my code itself.
What you shouldn't store are built objects. These should be stored in a release repository and fetched as needed. Maven and Ant w/ Ivy does a great job of this. And, you can use the Maven repository structure in C, C++, and C# projects too.
在 Subversion 中,您可以锁定二进制文件以确保其他人无法编辑它们。这主要向您保证,当您锁定该二进制文件时,没有其他人会修改该文件。分布式 VCS 没有(也不能)有锁——没有可供注册的中央存储库。
In Subversion you can lock binary files to make sure that nobody else can edit them. This mostly assures you that nobody else will modify that binary file while you have it locked. Distributed VCSs don't (and can't) have locks--there's no central repository for them to be registered at.
文本文件具有二进制文件所缺乏的自然的面向行的结构。这就是为什么使用常见的文本工具(diff)比较它们比较困难。虽然这应该是可能的,但是当将差异应用于二进制文件时,可读性的优势(我们首先使用文本作为首选格式的原因)将会丢失。
至于您认为所有版本控制系统“在处理二进制文件方面都很糟糕”的建议,我不知道。原则上,没有理由认为二进制文件的处理速度应该更慢。我宁愿说,在处理文本文件时,使用 VCS(跟踪、差异、概述)的优势更加明显。
Text files have a natural line-oriented struture that binary files lack. This is why it's harder to compare them using common text tools (diff). While it should be possible, the advantage of readability (the reason we use text as our preferred format in the first place) would be lost when applying diffs to binary files.
As to your suggestion that all version control systems "are crap at handling binary files", I don't know. In principle, there's no reason why a binary file should be slower to process. I would rather say that the advantages of using a VCS (tracking, diffs, overview) are more apparent when handling text files.