Git 如何同时节省空间且速度快?

发布于 2024-09-02 07:34:50 字数 480 浏览 3 评论 0 原文

我刚刚在 http://blip.tv/play/Aeu2CAI

Git 如何存储所有文件的所有版本,以及如何在空间上比 Subversion< /a> 只保存最新版本的代码?

我知道这可以通过压缩来完成,但这会以速度为代价,但这也表明 Git 更快(尽管它获得最大收益的事实是它的大部分操作都是离线的)。

所以,我的猜测是

  • Git 广泛压缩数据
  • 它仍然更快,因为 uncompression + work 仍然比 network_fetch + work

我正确吗?甚至接近?

I just saw the first Git tutorial at http://blip.tv/play/Aeu2CAI.

How does Git store all the versions of all the files, and how can it still be more economical in space than Subversion which saves only the latest version of the code?

I know this can be done using compression, but that would be at the cost of speed, but this also says that Git is much faster (though where it gains the maximum is the fact that most of its operations are offline).

So, my guess is that

  • Git compresses data extensively
  • It is still faster because uncompression + work is still faster than network_fetch + work

Am I correct? Even close?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

路弥 2024-09-09 07:34:50

我假设您在问 git 克隆(完整存储库 + 签出)如何可能比 Subversion 中签出的源文件更小。或者你还有别的意思吗?

这个问题在评论中得到了回答


存储库大小

首先,您应该考虑到,在签出(工作版本)时,Subversion 在这些 .svn 子目录中存储原始副本(最新版本)。原始副本未压缩地存储在 Subversion 中。

其次,git 使用以下技术来缩小存储库:

  • 文件的每个版本仅存储一次;这意味着,如果某个文件在 10 个修订版(10 次提交)中只有两个不同版本,则 git 仅存储这两个版本,而不是 10 个版本。
  • 对象(和增量,见下文)以压缩方式存储;编程中使用的文本文件在重新打包后压缩得非常好(大约是原始大小的 60%,或者压缩后大小减少了 40%)
  • ,对象以增量形式存储,这与其他版本不同;另外,git 尝试以增量主要由删除组成的方式对增量链进行排序(在文件增长的通常情况下,它是按新近度顺序排列的); IIRC 增量也被压缩。

性能(操作速度)

首先,任何涉及网络的操作都会比本地操作慢得多。因此,例如将工作区的当前状态与其他版本进行比较,或者获取日志(历史记录),这在Subversion中涉及网络连接和网络传输,而在Git中是本地操作,当然在Subversion中会比Subversion慢得多在 Git 中。顺便提一句。这是集中式版本控制系统(使用客户端-服务器工作流程)和分布式版本控制系统(使用点对点工作流程)之间的区别,不仅是 Subversion 和吉特。

其次,如果我理解正确的话,现在的限制不是CPU,而是IO(磁盘访问)。因此,由于压缩(并且能够将其映射到内存中)而必须从磁盘读取更少数据所获得的收益可能会克服必须解压缩数据所带来的损失。

第三,Git 的设计考虑了性能(参见 Git Wiki 上的 GitHistory 页面) ):

  • 索引存储文件的统计信息,Git 使用它来决定而不检查文件是否修改了文件(参见例如 core.trustctime 配置变量)。
  • 最大增量深度限制为 pack.depth,默认为 50。Git 有增量缓存来加快访问速度。有(生成的)packfile 索引,用于快速访问 packfile 中的对象。
  • Git 会注意不碰不需要的文件。例如,当切换分支或回退到另一个版本时,Git 仅更新已更改的文件。这种哲学的结果是 Git 只支持非常小的关键字扩展(至少是开箱即用的)。
  • Git 使用它的 “LibXDiff主页”>LibXDiff库,现在也用于比较和合并,而不是调用外部比较/外部合并工具。
  • Git 尝试最大程度地减少延迟,这意味着良好的感知性能。例如,它会尽快输出“git log”的第一页,并且您几乎可以立即看到它,即使生成完整历史记录需要更多时间;它不会等待完整的历史记录生成后再显示。
  • 当获取新的更改时,Git 会检查您与服务器有哪些共同对象,并以精简包文件的形式仅发送(压缩的)差异。诚然,Subversion 也可以(或者默认情况下)在更新时仅发送差异。

我不是 Git 黑客,我可能错过了 Git 用于获得更好性能的一些技术和技巧。但请注意,Git 为此大量使用 POSIX(如内存映射文件),因此在 MS Windows 上的增益可能不会那么大。

I assume you are asking how it is possible for a git clone (full repository + checkout) to be smaller than checked-out sources in Subversion. Or did you mean something else?

This question is answered in the comments


Repository size

First you should take into account that along checkout (working version) Subversion stores pristine copy (last version) in those .svn subdirectories. Pristine copy is stored uncompressed in Subversion.

Second, git uses the following techniques to make repository smaller:

  • each version of a file is stored only once; this means that if you have only two different versions of some file in 10 revisions (10 commits), git stores only those two versions, not 10.
  • objects (and deltas, see below) are stored compressed; text files used in programming compress really well (around 60% of original size, or 40% reduction in size from compression)
  • after repacking, objects are stored in deltified form, as a difference from some other version; additionally git tries to order delta chains in such a way that the delta consists mainly of deletions (in the usual case of growing files it is in recency order); IIRC deltas are compressed as well.

Performance (speed of operations)

First, any operation that involves network would be much slower than a local operation. Therefore for example comparing current state of working area with some other version, or getting a log (a history), which in Subversion involves network connection and network transfer, and in Git is a local operation, would of course be much slower in Subversion than in Git. BTW. this is the difference between centralized version control systems (using client-server workflow) and distributed version control systems (using peer-to-peer workflow), not only between Subversion and Git.

Second, if I understand it correctly, nowadays the limitation is not CPU but IO (disk access). Therefore it is possible that the gain from having to read less data from disk because of compression (and being able to mmap it in memory) overcomes the loss from having to decompress data.

Third, Git was designed with performance in mind (see e.g. GitHistory page on Git Wiki):

  • The index stores stat information for files, and Git uses it to decide without examining files if the files were modified or not (see e.g. core.trustctime config variable).
  • The maximum delta depth is limited to pack.depth, which defaults to 50. Git has delta cache to speed up access. There is (generated) packfile index for fast access to objects in packfile.
  • Git takes care to not touch files it doesn't have to. For example when switching branches, or rewinding to another version, Git updates only files that changed. The consequence of this philosophy is that Git does support only very minimal keyword expansion (at least out of the box).
  • Git uses its own version of LibXDiff library, nowadays also for diff and merge, instead of calling external diff / external merge tool.
  • Git tries to minimize latency, which means good perceived performance. For example it outputs first page of "git log" as fast as possible, and you see it almost immediately, even if generating full history would take more time; it doesn't wait for full history to be generated before displaying it.
  • When fetching new changes, Git checks what objects you have in common with the server, and sends only (compressed) differences in the form of thin packfile. Admittedly Subversion can (or perhaps by default it does) also send only differences when updating.

I am not a Git hacker, and I probably missed some techniques and tricks that Git uses for better performance. Note however that Git heavily uses POSIX (like memory mapped files) for that, so the gain might be not as large on MS Windows.

衣神在巴黎 2024-09-09 07:34:50

不是完整的答案,但那些评论(来自AlBlue) 可能有助于解决问题的空间管理方面:

这里有几件事值得澄清。

首先,可以拥有比 SVN 存储库更大的 Git 存储库;我希望我没有暗示事实并非如此。然而,在实践中,Git 存储库通常比同等的 SVN 存储库占用更少的磁盘空间。
您引用的一件事是 Apache 的单一 SVN 存储库,这显然是巨大的。然而,只需查看 git.apache.org,您就会注意到每个 Apache 项目都有自己的 Git 存储库。 真正需要的是同类比较;换句话说,签出 (abdera) SVN 项目与 (abdera) Git 存储库的克隆

我能够查看git://git.apache.org/abdera.git。在磁盘上,它消耗了 28.8Mb。
然后我查看了 SVN 版本 http://svn.apache.org/repos/asf/abdera/java/trunk/,它消耗了 34.3Mb。
这两个数字均取自 RAM 空间中单独安装的分区,引用的数字是从磁盘中获取的字节数。
如果使用 du -sh 作为测试手段,Git 签出为 11Mb,SVN 签出为 17Mb。

Apache Abdera 的 Git 版本可以让我处理历史版本的任何版本,包括当前版本; SVN 只会备份当前签出的版本。但它占用的磁盘空间更少。

你可能会问,怎么办?

嗯,一方面,SVN 创建了更多文件。 SVN签出有2959个文件;对应的Git仓库有845个文件。

其次,虽然 SVN 在层次结构的每个级别都有一个 .svn 文件夹,但 Git 存储库在顶层只有一个 .git 存储库。这意味着(除其他外)从一个目录重命名到另一个目录在 Git 中的影响比在 SVN 中相对较小,不可否认,无论如何,SVN 的影响已经相对较小。

第三,Git 将其数据存储为压缩对象,而 SVN 将其存储为未压缩副本。进入任何 .svn/text-base 目录,您将找到(基本)文件的未压缩副本。
Git 有一种机制可以将所有文件(实际上是所有历史记录)压缩到包文件中。在 Abdera 的例子中,.git/objects/pack/ 在 4.8Mb 文件中包含一个 .pack 文件(包含所有历史记录)。
因此,在本例中,存储库的大小(大致)与当前签出的代码大小相同,尽管我不希望情况总是如此。

无论如何,你是对的,历史记录可能会增长到超过当前结帐的总大小;但由于 SVN 的工作方式,它实际上必须接近两倍的大小才能产生很大的影响。即便如此,无论如何,磁盘空间的减少并不是使用 DVCS 的真正主要原因。当然,这对某些事情来说是一个优势,但这并不是人们使用它的真正原因。

请注意,Git(以及 Hg 和其他 DVCS)确实遇到了一个问题:(大型)二进制文件被签入,然后被删除,因为它们仍然会出现在存储库中并占用空间,即使它们“不是当前的。文本压缩可以处理文本文件的此类问题,但二进制文件则更成问题。 (有一些管理命令可以更新 Git 存储库的内容,但它们的开销/管理成本比 CVS 稍高;git filter-branch 就像 svnadmin dump/filter/load。)


(有一些 速度方面,我在我的“如何git 比远程操作颠覆更快吗?”答案(就像 Linus 在其 Google 演示中所说:(此处释义)“任何涉及网络的事情都会扼杀性能”)

以及 Jakub Narębski 提到的 >GitBenchmark 文档 是一个很好的补充,尽管它不直接处理与颠覆。
它确实列出了您需要在 DVCS 性能方面监控的操作类型。

这个SO问题中提到了其他Git基准。

Not a complete answer, but those comments (from AlBlue) might help on the space management aspect of the question:

There's a couple of things worth clarifying here.

Firstly, it is possible to have a bigger Git repository than an SVN repository; I hope I didn't imply that that was never the case. However, in practice, it generally tends to be the case that a Git repository takes less space on disk than an equivalent SVN repository would.
One thing you cite is Apache's single SVN repository, which is obviously massive. However, one only has to look at git.apache.org, and you'll note that each Apache project has its own Git repository. What's really needed is a comparison of like-for-like; in other words, a checkout of the (abdera) SVN project vs the clone of the (abdera) Git repository.

I was able to check out git://git.apache.org/abdera.git. On disk, it consumed 28.8Mb.
I then checked out the SVN version http://svn.apache.org/repos/asf/abdera/java/trunk/, and it consumed 34.3Mb.
Both numbers were taken from a separately mounted partition in RAM space, and the number quoted was the number of bytes taken from the disk.
If using du -sh as a means of testing, the Git checkout was 11Mb and the SVN checkout was 17Mb.

The Git version of Apache Abdera would let me work with any version of the history up to and including the current release; the SVN would only have the backup of the currently checked out version. Yet it takes less space on disk.

How, you may ask?

Well, for one thing, SVN creates a lot more files. The SVN checkout has 2959 files; the corresponding Git repository has 845 files.

Secondly, whilst SVN has an .svn folder at each level of the hierarchy, a Git repo only has a single .git repository at the top level. This means (amongst other things) that renames from one dir to another have relatively smaller impact in Git than in SVN, which admitedly, already has relatively small impact anyway.

Thirdly, Git stores its data as compressed objects, whereas SVN stores them as uncompressed copies. Go into any .svn/text-base directory, and you'll find uncompressed copies of the (base) files.
Git has a mechanism to compress all files (and indeed, all history) into pack files. In Abdera's case, .git/objects/pack/ has a single .pack file (containing all history) in a 4.8Mb file.
So the size of the repository is (roughly) the same size as the current checked out code in this case, though I wouldn't expect that always to be the case.

Anyway, you're right that history can grow to be more than the total size of the current checkout; but because of the way that SVN works, it really has to approach twice the size in order to make much of a difference. Even then, disk space reduction is not really the main reason to use a DVCS anyway; it's an advantage for some things, sure, but it's not the real reason why people use it.

Note that Git (and Hg, and other DVCSs) do suffer from a problem where (large) binaries are checked in, then deleted, as they'll still show up in the repository and take up space, even if they're not current. The text compression takes care of these kind of things for text files, but binary ones are more of an issue. (There are administrative commands that can update the contents of Git repositories, but they have slightly higher overhead/administrative cost than CVS; git filter-branch is like svnadmin dump/filter/load.)


As for the speed aspect, I mentioned it in my "How fast is git over subversion with remote operations?" answer (like Linus said in its Google presentation: (paraphrasing here) "anything involving network will just kill the performances")

And the GitBenchmark document mentioned by Jakub Narębski is a good addition, even though it doesn't deal directly with Subversion.
It does list the kind of operation you need to monitor on a DVCS performance-wise.

Other Git benchmarks are mentioned in this SO question.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文