源代码控制系统的算法?
我需要编写一个简单的源代码控制系统,并想知道我将使用什么算法来处理文件差异?
由于许可证问题,我不想研究现有的源代码。我需要获得 MPL 许可,因此我无法查看任何现有系统(如 CVS 或 Mercurial),因为它们都是 GPL 许可的。
只是为了提供一些背景知识,我只需要一些非常简单的功能 - 文件夹中的二进制文件。没有子文件夹,每个文件的行为就像它自己的存储库一样。除某些权限外没有元数据。
总的来说,非常简单的东西,我真正关心的是如何仅存储文件在各个版本之间的差异,而不浪费太多空间,但也不会太低效(也许每次更改时存储一个完整版本,有点像视频中的关键帧?)
I need to write a simple source control system and wonder what algorithm I would use for file differences?
I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.
Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves like it's own repository. No Metadata except for some permissions.
Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
最长公共子序列算法是类似 diff 的工具使用的主要机制,并且可以由源代码控制系统利用。
“反向增量”是一种常见的存储方法,因为您主要需要从最近的修订版本向后移动。
Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.
"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.
Patience Diff 是一种很好的算法,用于查找两个文件之间可能对人们有意义的差异。这通常比简单的“最长公共子序列”算法给出更好的结果,但结果是主观的。
话虽如此,许多现代版本控制系统在每个阶段都存储完整的文件,并在以后仅在需要时计算实际差异。对于二进制文件(可能不太可压缩),您可能会发现存储反向增量最终可能会更有效。
Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.
Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.
查看 Subversion 的源代码怎么样?它根据 Apache License 2.0 获得许可
How about looking the source code of Subversion ? its licensed under Apache License 2.0
Gene Myers 写了一篇很好的论文 一种 O(ND) 差分算法及其变体。当谈到比较序列时,迈尔斯就是那个人。您可能还应该阅读 Walter Tichy 关于 RCS 的论文;它解释了如何通过存储最新版本和差异来存储一组文件。
Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.
存储增量(向前或向后)的想法对于版本控制来说是经典的。问题一直是“你存储什么增量?”
许多源控制系统存储基本上通过“diff”计算的增量,例如,最长公共子序列的面向行的补集。但是您可以以特定于特定类型文档的方式计算这些文档的增量,以获得更小的(通常更容易理解的)增量。
对于编程语言源代码,可以计算程序结构上的编辑距离。可以在 Smart Differencer 中找到一组用于对各种流行编程语言执行此操作的工具。
如果您存储非文本文件,您也许可以利用它们的结构来计算较小的增量。
当然,如果您想要的是最小的实现,那么只需存储每个文件版本的完整映像就很容易了。太字节磁盘使该解决方案可行,即使不是很漂亮。 (PDP10 文件系统曾经隐式地执行此操作)。
The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"
Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.
For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer
If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.
Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).
虽然fossil是GPL,但delta算法是基于rsync并描述的
Though fossil is GPL, the delta algorithm is based on rsync and described here
前几天我实际上在考虑与此类似的事情......(奇怪,嗯?)
我没有给你一个很好的答案,但我确实得出的结论是,如果我要编写一个文件比较工具,那么我会使用一种算法(用于查找差异)来实现这一点,该算法的功能有点类似于正则表达式的贪婪功能。
至于存储 DIFF...如果我是您,不要存储前向 DIFF(即,您从原始文件开始,然后在使用版本 151 时,计算机对其进行 150 个差异),而是使用存储的 DIFF 来存储您的历史记录但将您的最新文件存储为完整版本。如果您这样做,那么每当您使用最新文件时(可能是 99% 的时间),您都会获得最佳性能。
I was actually thinking about something similar to this the other day... (odd, huh?)
I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.
As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.