Git 中的合并如何比 SVN 中更好?
我在一些地方听说分布式版本控制系统大放异彩的主要原因之一是比 SVN 等传统工具更好的合并。 这实际上是由于两个系统工作方式的固有差异造成的,还是像 Git/Mercurial 这样的特定 DVCS 实现具有比 SVN 更聪明的合并算法?
I've heard in a few places that one of the main reasons why distributed version control systems shine, is much better merging than in traditional tools like SVN.
Is this actually due to inherent differences in how the two systems work, or do specific DVCS implementations like Git/Mercurial just have cleverer merging algorithms than SVN?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
为什么 DVCS 中的合并比 Subversion 中的合并更好的说法主要是基于不久前分支和合并在 Subversion 中的工作原理。 1.5.0 之前的 Subversion 没有不存储有关何时合并分支的任何信息,因此当您想要合并时,您必须指定必须合并的修订范围。
那么为什么 Subversion 合并会糟糕?
思考这个例子:
当我们想要将 b1 的更改合并到主干中时,我们站在已签出 trunk 的文件夹上时,发出以下命令:
...这将尝试将
b1
中的更改合并到本地工作目录中。然后在解决所有冲突并测试结果后提交更改。当您提交时,修订树将如下所示:然而,当版本树增长时,这种指定修订范围的方式很快就会失控,因为 subversion 没有任何关于何时以及哪些修订合并在一起的元数据。思考一下稍后会发生什么:
这很大程度上是 Subversion 存储库设计的问题,为了创建分支,您需要在存储库中创建一个新的虚拟目录,其中将包含trunk,但它不存储任何有关何时以及什么内容被合并回来的信息。这有时会导致严重的合并冲突。更糟糕的是,Subversion 默认使用双向合并,当两个分支头不与其共同祖先进行比较时,这对自动合并有一些严重的限制。
为了缓解这个问题,Subversion 现在存储分支和合并的元数据。这样就可以解决所有问题了吧?
哦,顺便说一句,Subversion 仍然很糟糕……
在像 subversion 这样的集中式系统上,虚拟目录很糟糕。为什么?因为每个人都可以查看它们……甚至是垃圾实验的。如果您想进行实验但又不想看到每个人及其阿姨的实验,那么分支是很好的选择。这是严重的认知噪音。添加的分支越多,看到的垃圾就越多。
存储库中的公共分支越多,跟踪所有不同的分支就越困难。因此,您将遇到的问题是该分支是否仍在开发中,或者它是否真的已经死亡,这在任何集中式版本控制系统中都很难判断。
大多数时候,据我所知,组织无论如何都会默认使用一个大分支。这是一种耻辱,因为这反过来将很难跟踪测试和发布版本,以及分支带来的任何其他好处。
那么为什么 DVCS(例如 Git、Mercurial 和 Bazaar)在分支和合并方面比 Subversion 更好呢?
原因很简单:分支是一个一流的概念。在设计上没有虚拟目录,分支是 DVCS 中的硬对象,为了简单地与存储库同步工作(即推送和拉)。
使用 DVCS 时要做的第一件事就是克隆存储库(git 的
克隆
,hg的克隆
和 bzr 的分支
)。克隆在概念上与在版本控制中创建分支相同。有些人称之为分叉或分支(尽管后者通常也用于指代同地分支),但这只是同一件事。每个用户都运行自己的存储库,这意味着您正在进行每用户分支。版本结构不是树,而是一个图。更具体地说,是一个有向无环图(DAG,意思是没有任何循环的图)。除了每个提交都有一个或多个父引用(提交所基于的引用)之外,您实际上不需要详细了解 DAG 的细节。因此,下图将因此反向显示修订版之间的箭头。
一个非常简单的合并例子是这样的;想象一个名为
origin
的中央存储库,用户 Alice 将存储库克隆到她的机器上。克隆过程中发生的情况是,每个修订版本都按原样复制到 Alice(通过唯一可识别的哈希 ID 进行验证),并标记源分支所在的位置。
然后,Alice 处理她的存储库,在她自己的存储库中提交并决定推送她的更改:
解决方案相当简单,
origin
存储库唯一需要做的就是接受所有新修订并将其分支移至最新版本(git 称之为“快进”):我在上面说明的用例甚至不需要合并任何内容。因此,问题实际上不在于合并算法,因为所有版本控制系统之间的三向合并算法几乎相同。 这个问题更多的是结构问题。
那么,您向我展示一个具有真实合并的示例怎么样?
诚然,上面的例子是一个非常简单的用例,所以让我们做一个更扭曲的例子,尽管是一个更常见的例子。还记得
origin
最初进行了三个修订吗?好吧,做这些事情的人,让我们称他为 Bob,他一直在自己工作,并在自己的存储库上进行了提交:现在 Bob 无法将他的更改直接推送到
源
存储库。系统如何检测这一点是通过检查 Bob 的修订是否直接源自origin
的修订,在本例中并非如此。任何推送尝试都会导致系统显示类似于“呃...我恐怕可以鲍勃,我不会让你这么做的。”因此,Bob 必须引入并合并更改(使用 git 的
pull
;或 hg 的拉
和合并
; 或 bzr 的合并
)。这是一个两步过程。首先,Bob 必须获取新的修订版本,这将从origin
存储库中复制它们。我们现在可以看到图表出现了分歧:拉取过程的第二步是合并分歧的提示并提交结果:
希望合并不会遇到冲突(如果您预计会发生冲突,您可以执行这两个步骤在 git 中手动使用
fetch
< /a> 和合并
< /a>)。稍后需要做的是将这些更改再次推送到origin
,这将导致快进合并,因为合并提交是origin< 中最新提交的直接后代/code> 存储库:
还有另一个选项可以合并 git 和 hg,称为rebase,它将把 Bob 的更改移到最新更改之后。因为我不想让这个答案变得更冗长,所以我会让你阅读 git 、mercurial 或 bazaar 有关于此的文档。
作为读者的练习,尝试描绘出与其他用户一起参与的情况如何。与上面鲍勃的示例类似。存储库之间的合并比您想象的更容易,因为所有修订/提交都是唯一可识别的。
还有每个开发人员之间发送补丁的问题,这是 Subversion 中的一个大问题,但在 git、hg 和 bzr 中通过唯一可识别的修订版本得到了缓解。一旦有人合并了他的更改(即进行了合并提交)并将其发送给团队中的其他人通过推送到中央存储库或发送补丁来使用,那么他们就不必担心合并,因为它已经发生了。 Martin Fowler 将这种工作方式称为“混杂集成”。
由于其结构与 Subversion 不同,因此通过使用 DAG 来代替,它不仅对系统而且对用户来说都可以更轻松地完成分支和合并。
The claim of why merging is better in a DVCS than in Subversion was largely based on how branching and merge worked in Subversion a while ago. Subversion prior to 1.5.0 didn't store any information about when branches were merged, thus when you wanted to merge you had to specify which range of revisions that had to be merged.
So why did Subversion merges suck?
Ponder this example:
When we want to merge b1's changes into the trunk we'd issue the following command, while standing on a folder that has trunk checked out:
… which will attempt to merge the changes from
b1
into your local working directory. And then you commit the changes after you resolve any conflicts and tested the result. When you commit the revision tree would look like this:However this way of specifying ranges of revisions gets quickly out of hand when the version tree grows as subversion didn't have any meta data on when and what revisions got merged together. Ponder on what happens later:
This is largely an issue by the repository design that Subversion has, in order to create a branch you need to create a new virtual directory in the repository which will house a copy of the trunk but it doesn't store any information regarding when and what things got merged back in. That will lead to nasty merge conflicts at times. What was even worse is that Subversion used two-way merging by default, which has some crippling limitations in automatic merging when two branch heads are not compared with their common ancestor.
To mitigate this Subversion now stores meta data for branch and merge. That would solve all problems right?
And oh, by the way, Subversion still sucks…
On a centralized system, like subversion, virtual directories suck. Why? Because everyone has access to view them… even the garbage experimental ones. Branching is good if you want to experiment but you don't want to see everyones' and their aunts experimentation. This is serious cognitive noise. The more branches you add, the more crap you'll get to see.
The more public branches you have in a repository the harder it will be to keep track of all the different branches. So the question you'll have is if the branch is still in development or if it is really dead which is hard to tell in any centralized version control system.
Most of the time, from what I've seen, an organization will default to use one big branch anyway. Which is a shame because that in turn will be difficult to keep track of testing and release versions, and whatever else good comes from branching.
So why are DVCS, such as Git, Mercurial and Bazaar, better than Subversion at branching and merging?
There is a very simple reason why: branching is a first-class concept. There are no virtual directories by design and branches are hard objects in DVCS which it needs to be such in order to work simply with synchronization of repositories (i.e. push and pull).
The first thing you do when you work with a DVCS is to clone repositories (git's
clone
, hg'sclone
and bzr'sbranch
). Cloning is conceptually the same thing as creating a branch in version control. Some call this forking or branching (although the latter is often also used to refer to co-located branches), but it's just the same thing. Every user runs their own repository which means you have a per-user branching going on.The version structure is not a tree, but rather a graph instead. More specifically a directed acyclic graph (DAG, meaning a graph that doesn't have any cycles). You really don't need to dwell into the specifics of a DAG other than each commit has one or more parent references (which what the commit was based on). So the following graphs will show the arrows between revisions in reverse because of this.
A very simple example of merging would be this; imagine a central repository called
origin
and a user, Alice, cloning the repository to her machine.What happens during a clone is that every revision is copied to Alice exactly as they were (which is validated by the uniquely identifiable hash-id's), and marks where the origin's branches are at.
Alice then works on her repo, committing in her own repository and decides to push her changes:
The solution is rather simple, the only thing that the
origin
repository needs to do is to take in all the new revisions and move it's branch to the newest revision (which git calls "fast-forward"):The use case, which I illustrated above, doesn't even need to merge anything. So the issue really isn't with merging algorithms since three-way merge algorithm is pretty much the same between all version control systems. The issue is more about structure than anything.
So how about you show me an example that has a real merge?
Admittedly the above example is a very simple use case, so lets do a much more twisted one albeit a more common one. Remember that
origin
started out with three revisions? Well, the guy who did them, lets call him Bob, has been working on his own and made a commit on his own repository:Now Bob can't push his changes directly to the
origin
repository. How the system detects this is by checking if Bob's revisions directly descents fromorigin
's, which in this case doesn't. Any attempt to push will result into the system saying something akin to "Uh... I'm afraid can't let you do that Bob."So Bob has to pull-in and then merge the changes (with git's
pull
; or hg'spull
andmerge
; or bzr'smerge
). This is a two-step process. First Bob has to fetch the new revisions, which will copy them as they are from theorigin
repository. We can now see that the graph diverges:The second step of the pull process is to merge the diverging tips and make a commit of the result:
Hopefully the merge won't run into conflicts (if you anticipate them you can do the two steps manually in git with
fetch
andmerge
). What later needs to be done is to push in those changes again toorigin
, which will result into a fast-forward merge since the merge commit is a direct descendant of the latest in theorigin
repository:There is another option to merge in git and hg, called rebase, which'll move Bob's changes to after the newest changes. Since I don't want this answer to be any more verbose I'll let you read the git, mercurial or bazaar docs about that instead.
As an exercise for the reader, try drawing out how it'll work out with another user involved. It is similarly done as the example above with Bob. Merging between repositories is easier than what you'd think because all the revisions/commits are uniquely identifiable.
There is also the issue of sending patches between each developer, that was a huge problem in Subversion which is mitigated in git, hg and bzr by uniquely identifiable revisions. Once someone has merged his changes (i.e. made a merge commit) and sends it for everyone else in the team to consume by either pushing to a central repository or sending patches then they don't have to worry about the merge, because it already happened. Martin Fowler calls this way of working promiscuous integration.
Because the structure is different from Subversion, by instead employing a DAG, it enables branching and merging to be done in an easier manner not only for the system but for the user as well.
从历史上看,Subversion 只能执行直接双向合并,因为它不存储任何合并信息。这涉及到进行一组更改并将它们应用到树上。即使有合并信息,这仍然是最常用的合并策略。
Git 默认使用 3 路合并算法,该算法涉及找到正在合并的头的共同祖先并利用合并双方都存在的知识。这使得 Git 能够更加智能地避免冲突。
Git 还有一些复杂的重命名查找代码,这也很有帮助。它不存储变更集或存储任何跟踪信息——它只是存储每次提交时文件的状态,并使用启发式方法根据需要定位重命名和代码移动(磁盘存储更多)比这复杂,但它向逻辑层呈现的接口不公开任何跟踪)。
Historically, Subversion has only been able to perform a straight two-way merge because it's didn't store any merge information. This involves taking a set of changes and applying them to a tree. Even with merge information, this is still the most commonly-used merge strategy.
Git uses a 3-way merge algorithm by default, which involves finding a common ancestor to the heads being merged and making use of the knowledge that exists on both sides of the merge. This allows Git to be more intelligent in avoiding conflicts.
Git also has some sophisticated rename finding code, which also helps. It doesn't store changesets or store any tracking information -- it just stores the state of the files at each commit and uses heuristics to locate renames and code movements as required (the on-disk storage is more complicated than this, but the interface it presents to the logic layer exposes no tracking).
简而言之,Git 中的合并实现比 SVN。在 1.5 版本之前,SVN 不记录合并操作,因此如果用户需要提供 SVN 未记录的信息的帮助,则无法进行将来的合并。 1.5 版本变得更好,事实上 SVN 存储模型比 Git 的 DAG 稍微强大一些。但 SVN 以相当复杂的形式存储合并信息,这使得合并花费的时间比 Git 多得多 - 我观察到执行时间为 300 倍。
此外,SVN 声称可以跟踪重命名以帮助合并移动的文件。但实际上它仍然将它们存储为副本和单独的删除操作,并且合并算法在修改/重命名情况下仍然会绊倒它们,也就是说,在一个分支上修改文件并在另一个分支上重命名,并且这些分支是被合并。这种情况仍然会产生虚假的合并冲突,并且在目录重命名的情况下,甚至会导致无提示的修改丢失。 (然后 SVN 人员倾向于指出修改仍然在历史记录中,但是当它们不在应该出现的合并结果中时,这并没有多大帮助。
另一方面,Git 甚至不跟踪重命名,但在事后(合并时)弄清楚它们,并且这样做非常神奇。
SVN 合并表示也有问题;在 1.5/1.6 中,您可以自动从主干合并到分支,但是合并。需要宣布另一个方向(
--reintegrate
),并让分支处于不可用状态,很久以后他们发现事实并非如此,并且 a) <。 code>--reintegrate 可以自动计算出来,并且 b) 可以在两个方向上重复合并。但在这一切之后(恕我直言,这表明我对他们在做什么缺乏了解),我会(好吧,我是)非常谨慎地在任何重要的分支场景中使用 SVN,并且理想情况下会尝试看看 Git 的想法合并结果。
答案中提出的其他观点,如 SVN 中分支的强制全局可见性,与合并功能无关(但与可用性相关)。此外,“Git 存储发生变化,而 SVN 存储(不同的东西)”基本上没有抓住重点。 Git 从概念上将每个提交存储为单独的树(如 tar 文件),然后使用相当多的启发式方法有效地存储它。计算两次提交之间的更改与存储实现是分开的。事实是,Git 以比 SVN 合并信息更简单的形式存储历史 DAG。任何试图理解后者的人都会明白我的意思。
简而言之:Git 使用比 SVN 更简单的数据模型来存储修订,因此它可以将大量精力投入到实际的合并算法中,而不是试图处理 => 的表示形式。实际上更好的合并。
Put simply, the merge implementation is done better in Git than in SVN. Before 1.5 SVN did not record a merge action, so it was incapable to do future merges without help by the user which needed to provide information that SVN did not record. With 1.5 it got better, and indeed the SVN storage model is slightly more capable that Git's DAG. But SVN stored the merge information in a rather convoluted form that lets merges take massively more time than in Git - I've observed factors of 300 in execution time.
Also, SVN claims to track renames to aid merges of moved files. But actually it still stores them as a copy and a separate delete action, and the merge algorithm still stumbles over them in modify/rename situations, that is, where a file is modified on one branch and rename on the other, and those branches are to be merged. Such situations will still produce spurious merge conflicts, and in the case of directory renames it even leads to silent loss of modifications. (The SVN people then tend to point out that the modifications are still in the history, but that doesn't help much when they aren't in a merge result where they should appear.
Git, on the other hand, does not even track renames but figures them out after the fact (at merge time), and does so pretty magically.
The SVN merge representation also has issues; in 1.5/1.6 you could merge from trunk to branch as often as just liked, automatically, but a merge in the other direction needed to be announced (
--reintegrate
), and left the branch in an unusable state. Much later they found out that this actually isn't the case, and that a) the--reintegrate
can be figured out automatically, and b) repeated merges in both directions are possible.But after all this (which IMHO shows a lack of understanding of what they are doing), I'd be (OK, I am) very cautions to use SVN in any nontrivial branching scenario, and would ideally try to see what Git thinks of the merge result.
Other points made in the answers, as the forced global visibility of branches in SVN, aren't relevant to merge capabilities (but for usability). Also, the 'Git stores changes while SVN stores (something different)' are mostly off the point. Git conceptually stores each commit as a separate tree (like a tar file), and then uses quite some heuristics to store that efficiently. Computing the changes between two commits is separate from the storage implementation. What is true is that Git stores the history DAG in a much more straightforward form that SVN does its mergeinfo. Anyone trying to understand the latter will know what I mean.
In a nutshell: Git uses a much simpler data model to store revisions than SVN, and thus it could put a lot of energy into the actual merge algorithms rather than trying to cope with the representation => practically better merging.
其他答案中没有提到的一件事是,您可以在推送更改之前在本地提交,这确实是 DVCS 的一大优势。在 SVN 中,当我有一些更改时,我想要签入,并且同时有人已经在同一个分支上完成了提交,这意味着我必须在提交之前执行
svn update
。这意味着我的更改和其他人的更改现在混合在一起,并且无法中止合并(例如使用git reset
或hg update -C
),因为没有可返回的提交。如果合并非常重要,则意味着在清理合并结果之前您无法继续处理您的功能。但是,也许这对于那些太笨而无法使用单独分支的人来说只是一个优势(如果我没记错的话,在我使用 SVN 的公司里我们只有一个用于开发的分支)。
One thing that hasn't been mentioned in the other answers, and that really is a big advantage of a DVCS, is that you can commit locally before you push your changes. In SVN, when I had some change I wanted to check in, and someone had already done a commit on the same branch in the meantime, this meant that I had to do an
svn update
before I could commit. This means that my changes, and the changes from the other person are now mixed together, and there is no way to abort the merge (like withgit reset
orhg update -C
), because there is no commit to go back to. If the merge is non-trivial,this means that you can't continue to work on your feature before you have cleaned up the merge result.But then, maybe that is only an advantage for people who are too dumb to use separate branches (if I remember correctly, we had only one branch that was used for development back in the company where I used SVN).
编辑:这主要是解决问题的这部分:
这实际上是由于两个系统工作方式的固有差异造成的,还是像 Git/Mercurial 这样的特定 DVCS 实现只是拥有比 SVN 更聪明的合并算法?
TL;DR - 这些特定工具有更好的算法。分布式具有一些工作流程优势,但与合并优势正交。
结束编辑
我阅读了接受的答案。这完全是错误的。
SVN 合并可能很痛苦,也可能很麻烦。但是,暂时忽略它的实际工作原理。 Git 保留或导出的任何信息都不是 SVN 所不具备的保留或可以导出。更重要的是,没有理由保留版本控制系统的单独(有时是部分)副本将为您提供更多实际信息。两种结构完全等效。
假设你想做一些 Git“更擅长”的“聪明的事情”。你的事情已经被签入 SVN 了。
将您的 SVN 转换为等效的 Git 形式,在 Git 中执行此操作,然后检查结果(可能使用多次提交)和一些额外的分支。如果你能想象一种自动化的方式将 SVN 问题转化为 Git 问题,那么 Git 就没有根本优势了。
归根结底,任何版本控制系统都会让我
此外,对于合并来说,了解
Mercurial、Git 和 Subversion(现在是原生的,以前使用 svnmerge.py)都可以提供所有这三项信息。为了从根本上更好地演示 DVC,请指出 Git/Mercurial/DVC 中提供的第四条信息,而 SVN/集中式 VC 中不提供这些信息。
这并不是说它们不是更好的工具!
EDIT: This is primarily addressing this part of the question:
Is this actually due to inherent differences in how the two systems work, or do specific DVCS implementations like Git/Mercurial just have cleverer merging algorithms than SVN?
TL;DR - Those specific tools have better algorithms. Being distributed has some workflow benefits, but is orthogonal to the merging advantages.
END EDIT
I read the accepted answer. It's just plain wrong.
SVN merging can be a pain, and it can also be cumbersome. But, ignore how it actually works for a minute. There is no information that Git keeps or can derive that SVN doesn't also keep or can derive. More importantly, there is no reason why keeping separate (sometimes partial) copies of the version control system will provide you with more actual information. The two structures are completely equivalent.
Assume you want to do "some clever thing" Git is "better at". And you're thing is checked into SVN.
Convert your SVN into the equivalent Git form, do it in Git, and then check the result in, perhaps using multiple commits, some extra branches. If you can imagine an automated way to turn an SVN problem into a Git problem, then Git has no fundamental advantage.
At the end of the day, any version control system will let me
Additionally, for merging it's also useful (or critical) to know
Mercurial, Git and Subversion (now natively, previously using svnmerge.py) can all provide all three pieces of information. In order to demonstrate something fundamentally better with DVC, please point out some fourth piece of information which is available in Git/Mercurial/DVC not available in SVN / centralized VC.
That's not to say they're not better tools!
SVN 跟踪文件,而 Git 跟踪
content更改。它足够聪明,可以跟踪从一个类/文件重构为另一个类/文件的代码块。他们使用两种完全不同的方法来跟踪您的来源。我仍然大量使用 SVN,但我对使用过 Git 的几次感到非常满意。
如果您有时间,这是一本不错的书:为什么我选择 Git
SVN tracks files while Git tracks
contentchanges. It is clever enough to track a block of code that was refactored from one class/file to another. They use two complete different approaches to tracking your source.I still use SVN heavily, but I am very pleased with the few times I've used Git.
A nice read if you have the time: Why I chose Git
刚刚读了乔尔博客上的一篇文章(遗憾的是他的最后一篇)。这篇是关于Mercurial的,但它实际上谈论的是分布式VC系统(例如Git)的优点。
请阅读此处文章。
Just read an article on Joel's blog(sadly his last one). This one is about Mercurial, but it actually talks about advantages of Distributed VC systems such as Git.
Read the article here.