“总计”中的数字代表什么? git gc/git repack 输出行意味着什么?
当我在 Git 存储库上运行 git gc 或 git repack 时,它会在完成后输出“Total”行。这些数字意味着什么?
几个来自相当小的存储库的示例:
$ git gc
...
Total 576 (delta 315), reused 576 (delta 315)
$ git repack -afd --depth=250 --window=250
...
Total 576 (delta 334), reused 242 (delta 0)
一个来自更大的存储库的示例:
$ git gc
...
Total 347629 (delta 289610), reused 342219 (delta 285060)
...
我可以猜测第一个“总计”数字是多少:存储库中的 Git 对象(提交、树和文件)的数量。其他所有内容实际上是什么意思?
我已经查看了 git-gc(1)
和 git-repack(1)
手册页,并仔细阅读了他们的“另请参阅”,以及我的谷歌搜索的尝试只产生了无关紧要的结果。
When I run git gc
or git repack
over my Git repository, it outputs a "Total" line once it's done. What do these numbers mean?
A couple of examples from a fairly small repository:
$ git gc
...
Total 576 (delta 315), reused 576 (delta 315)
$ git repack -afd --depth=250 --window=250
...
Total 576 (delta 334), reused 242 (delta 0)
And one from a much larger repository:
$ git gc
...
Total 347629 (delta 289610), reused 342219 (delta 285060)
...
I can guess what that first "Total" number is: the number of Git objects (so commits, trees and files) in the repository. What do all the others actually mean?
I've already looked at the git-gc(1)
and git-repack(1)
man pages, and perused their "See also"s, too, and my attempts at Googling have only produced irrelevant results.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我使用 dulwich 做了一些工作,它是 Git 的纯 Python 实现。我在这里要说的内容反映了我对德威 git 实现的经验,而不是规范的 git 源,因此可能存在差异。
Git 非常简单——我的意思是,简单到让人困惑!这个名字非常适合它的设计,由于它的愚蠢而非常聪明。
当您提交任何内容时,git 会获取索引(暂存区域)中的内容并创建 SHA 摘要项,因此每个文件都将被 SHAed,每个目录中的文件将被 SHAed 作为 blob 对象,当然目录结构也会被 SHAed 作为树对象,并且所有这些都被绑定到一个也有 SHA 的提交对象中。 Git 在处理提交时只是将这些直接发送到 .git/objects 中的文件系统中。如果它成功地触发了所有这些,它只是将最近提交对象的 SHA 写入 .git/refs/heads/ 。
有时,提交可能会中途失败。如果某些内容无法写入 .git/objects,git 此时不会进行清理。这是因为通常您会解决问题并重做提交 - 在这种情况下,git 将从之前停止的位置(即提交的一半)准确地重新启动。
这就是 git gc 的用武之地。它只是解析 .git/objects 中的所有对象,标记出所有通过 HEAD 或 BRANCH 以某种方式引用的对象。显然剩下的任何内容都是孤立的,与任何“重要”的内容无关,因此可以将其删除。这就是为什么如果你分支,在该分支上做一些工作,但后来放弃该分支并从你的 git 存储库中删除对它的任何引用,定期运行的 git gc 将完全清除你的分支。这可能会让一些老 VCS 用户感到惊讶,例如 CVS 永远不会忘记任何事情,除非它自身崩溃或损坏(这种情况经常发生)。
git repack (实际上是 git-pack-objects)与 git gc 完全不同(例如,一个单独的命令和操作,尽管 git gc 可能会调用 git repack)。正如我之前提到的,git 只是将所有内容发送到它自己的 SHAed 文件中。它会在进入磁盘存储之前对它们进行 gzip 压缩,但显然从长远来看这并不节省空间。因此 git-pack-objects 所做的就是检查一系列 SHA 对象,以查找数据跨版本复制的任何位置。它不关心它是什么类型的 SHA 对象 - 所有的都被认为是平等的打包。然后,它会生成有意义的二进制增量,并将整个批次作为 .pack 文件存储在 .git/objects/pack 中,从正常目录结构中删除任何打包对象。
请注意,如果最新的包文件大小小于 1Mb,通常 git-pack-objects 会创建一个新的 .pack 文件,而不是替换现有的 .pack 文件。因此,随着时间的推移,您将看到多个 .pack 文件出现在 .git/objects/pack 中。事实上,当您 git fetch 时,您只需要求远程存储库打包所有未打包的项目,并将获取存储库不需要的 .pack 文件发送到获取存储库。 git repack 只是调用 git-pack-objects 但告诉它合并 .pack 文件,因为它认为合适。这意味着解压缩任何已更改的内容,重新生成二进制增量并重新压缩。
因此,为了回答你的问题,总行指的是 git 存储库中的对象总数。第一个增量数是二进制增量对象的总对象数,即 git 确定有多少对象与其他对象具有很强的相似性并且可以存储为二进制增量。重用的数量指示有多少来自压缩源(即包文件)的对象正在被使用而没有被重新压缩以包括更多最近的更改。当您有多个包文件,但更新的 SHA 对象引用旧包文件中的项目作为其基础,然后对其应用增量以使其现代时,就会发生这种情况。这使得 git 可以使用之前压缩的较旧版本的数据,而无需重新压缩它以包含最新的添加内容。请注意,git 可能会追加到现有的包文件中,而无需重写整个包文件。
一般来说,高重用计数表明可以通过完全重新打包(即 git repack -a)回收一些空间,这将始终将重用返回为零。然而,通常 git 会默默地为你处理所有这些事情。此外,进行完整的重新打包可能会迫使一些 git fetch 从头开始,因为包不同 - 这取决于服务器设置(允许自定义每个客户端包生成对服务器 CPU 来说是昂贵的,因此一些主要的 GIT 站点禁用它)。
希望这能回答您的问题。实际上,对于 git 来说,它是如此简单,一开始你会惊讶它竟然能工作,然后当你认真思考它时,你会留下深刻的印象。只有真正天才的程序员才能编写出如此简单但效果如此好的东西,因为他们可以看到大多数程序员只能看到复杂性的简单性。
尼尔
I did some work with dulwich, a pure python implementation of Git. What I am about to say here reflects my experience with dulwich's git implementation, not the canonical git source and so there may be differences.
Git is remarkably simple - I mean, so simple it confounds! The name is really appropriate to its design which is very clever due to its stupidity.
When you commit anything, git takes what's in the index (staging area) and creates SHA digest items, so each file gets SHAed and the files in each directory get SHAed as blob objects and of course the directory structure gets SHAed as tree objects, and all that gets bound into a commit object which also has a SHA. Git just fires these straight into the filing system in .git/objects as it processes the commit. If it succeeds at firing all of them in there, it simply writes the SHA of the most recent commit object into .git/refs/heads/.
From time to time a commit may fail half way through. If something fails to write into .git/objects, git does no cleanup at that time. That's because usually you'll fix the problem and redo the commit - in this case, git will restart exactly from where it previously halted i.e. half way through the commit.
Here's where git gc comes in. It simply parses through all objects in .git/objects, marking off all those which are referred to in some way by a HEAD or a BRANCH. Anything remaining obviously is orphaned and has nothing to do with anything "important", so it can be deleted. This is why if you branch, do some work on that branch but later abandon that branch and delete any reference to it from your git repo, the periodic git gc which runs will totally purge your branch. This can surprise some older VCS users e.g. CVS never forgot anything except when it crashed or corrupted itself (which was often).
git repack (really git-pack-objects) is totally different to git gc (as in, a separate command and operation though git gc may call git repack). As I mentioned earlier, git just fires everything into its own SHAed file. It does gzip them before going to disc storage, but obviously this isn't space efficient over the long run. So what git-pack-objects does is to examine a series of SHA objects for anywhere where data replicates across revisions. It doesn't care what kind of SHA object it is - all are considered equal for packing. It then generates binary deltas where those make sense, and stores the entire lot as a .pack file in .git/objects/pack, removing any packed objects from the normal directory structure.
Note that generally git-pack-objects makes a new .pack file rather than replacing existing .pack files, if the most recent pack file is less than 1Mb in size. Thus, over time you'll see multiple .pack files appear in .git/objects/pack. Indeed, when you git fetch, you simply ask the remote repo to pack all unpacked items and to send the .pack files that the fetching repo doesn't have to the fetching repo. git repack simply calls git-pack-objects but tells it to merge .pack files as it sees fit. That implies decompressing anything which has changed, regenerating the binary deltas and recompressing.
So, to answer your question, the total line refers to the total number of objects in the git repo. The first delta number is the number of those total objects which are binary delta objects i.e. how many objects git has decided have a strong similarity with other objects and can be stored as a binary delta. The reused number indicates how many objects from a compressed source (i.e. a packfile) are being used without having been recompressed to include more recent changes. This would occur when you have multiple packfiles but where a more recent SHA object refers to an item in an old packfile as its base, then applies deltas to it to make it modern. This lets git make use of previously compressed older revisions of data without having to recompress it to include more recent additions. Note that git may append to an existing pack file without rewriting the entire pack file.
Generally speaking, a high reused count indicates that some space could be reclaimed with a full repack (i.e. a git repack -a) which will always return reused to zero. However, generally git will silently take care of all of that for you. Also, doing full repacks may force some git fetches to restart from scratch because the packs differ - this depends on server settings (allowing custom per-client pack generation is expensive on server CPU, so some major GIT sites disable it).
Hopefully this answers your question. Really with git it is so simple you're amazed it works at all in the beginning, then as you wrap your head around it you become seriously impressed. Only truly genius programmers can write something so simple yet works so well because they can see simplicity where most programmers can only see complexity.
Niall