Git 和大多数其他版本控制系统之间的一个主要区别是,其他版本控制系统倾向于将提交存储为一系列增量 - 一次提交与下一次提交之间的更改集。这似乎是合乎逻辑的,因为它是存储有关提交的尽可能少的信息。但是提交历史记录越长,比较修订范围所需的计算就越多。
相比之下,Git 在每个版本中存储整个项目的完整快照。这不会使存储库大小随着每次提交而急剧增长,因为项目中的每个文件都作为文件存储在 Git 子目录中,并以其内容的哈希值命名。因此,如果内容没有改变,哈希值也没有改变,提交只是指向同一个文件。还有其他优化。
所有这些对我来说都是有意义的,直到我偶然发现有关包文件的信息,其中Git 定期放置数据以节省空间:
为了节省空间,Git
使用 packfile。这是一个
Git 只会保存的格式
第二部分发生变化的部分
文件,带有指向该文件的指针
类似于。
这不是基本上回到了存储增量吗?如果不是,有什么不同?这如何避免让 Git 遇到与其他版本控制系统相同的问题?
例如,Subversion 使用增量,回滚 50 个版本意味着撤消 50 个差异,而使用 Git,您只需获取适当的快照。除非 git 还在包文件中存储 50 个差异...是否有某种机制说“在一些少量的增量之后,我们将存储一个全新的快照”,这样我们就不会堆积太大的变更集? Git 还能如何避免增量的缺点呢?
One of the key differences between Git and most other version control systems is that the others tend to store commits as a series of deltas - changesets between one commit and the next. This seems logical, since it's the smallest possible amount of information to store about a commit. But the longer the commit history gets, the more calculation it takes to compare ranges of revisions.
By contrast, Git stores a complete snapshot of the whole project in each revision. The reason this doesn't make the repo size grow dramatically with each commit is each file in the project is stored as a file in the Git subdirectory, named for the hash of its contents. So if the contents haven't changed, the hash hasn't changed, and the commit just points to the same file. And there are other optimizations as well.
All this made sense to me until I stumbled on this information about pack files, into which Git puts data periodically to save space:
In order to save that space, Git
utilizes the packfile. This is a
format where Git will only save the
part that has changed in the second
file, with a pointer to the file it is
similar to.
Isn't this basically going back to storing deltas? If not, how is it different? How does this avoid subjecting Git to the same problems other version controls systems have?
For example, Subversion uses deltas, and rolling back 50 versions means undoing 50 diffs, whereas with Git you can just grab the appropriate snapshot. Unless git also stores 50 diffs in the packfiles... is there some mechanism that says "after some small number of deltas, we'll store a whole new snapshot" so that we don't pile up too large a changeset? How else might Git avoid the disadvantages of deltas?
发布评论
评论(3)
摘要:
Git 的包文件经过精心构建,可以有效地使用磁盘缓存和
为常用命令和最近引用的阅读提供“良好”的访问模式
对象。
Git的包文件
格式非常灵活(参见 文档/技术/pack-format.txt,
或 Packfile 中的 Git 社区书籍)。
包文件将对象存储在两个主要位置
方式:“undeltified”(获取原始对象数据并压缩压缩
它),或“deltified”(与其他对象形成增量,然后
deflate-压缩生成的增量数据)。存储的对象在
包可以是任何顺序(它们不必(必然)是
按对象类型、对象名称或任何其他属性排序)和
可以针对相同类型的任何其他合适的对象来制作细节化的对象。
Git 的 pack-objects 命令使用多个 启发式
为常见问题提供出色的参考地点
命令。这些启发式方法控制着碱基的选择
用于增量对象的对象和对象的顺序。每个
机制大多是独立的,但它们有一些共同的目标。
Git 确实形成了增量压缩对象的长链,但是
启发式尝试确保只有“旧”对象位于末尾
长链子。增量基本缓存(其大小由
core.deltaBaseCacheLimit
配置变量)是自动的使用,可以大大减少所需的“重建”次数
需要读取大量对象的命令(例如
git log
)。-p
Delta 压缩启发式
典型的 Git 存储库存储大量对象,因此
它无法合理地比较它们全部以找到配对(并且
链),这将产生最小的增量表示。
Delta 碱基选择启发式基于以下思想:
将在具有相似文件名的对象中找到良好的增量基础
和尺寸。每种类型的对象都单独处理(即
一种类型的对象永远不会用作某个类型的增量基础
另一种类型的对象)。
出于增量基础选择的目的,对象(主要)按
文件名,然后是大小。此排序列表的窗口用于限制
被视为潜在增量基础的对象数量。
如果未找到对象的“足够好”1 增量表示
在其窗口中的对象之间,则该对象不会是 delta
压缩的。
窗口的大小由
--window=
选项控制git pack-objects 或 pack.window 配置变量。这
Delta 链的最大深度由
--depth=
控制git pack-objects
选项,或pack.depth
配置多变的。
git gc
的--aggressive
选项大大放大了尝试创建的窗口大小和最大深度
较小的包文件。
文件名排序将条目的对象聚集在一起 with
相同的名称(或至少相似的结尾(例如
.c
))。尺寸排序是从最大到最小,以便删除数据的增量是
优于添加数据的增量(因为删除增量的时间更短)
表示),因此较早、较大的物体(通常
较新的)倾向于用普通压缩来表示。
1
什么才算“足够好”取决于所讨论对象的大小及其潜在的增量基础以及由此产生的增量链的深度。
对象排序启发式
对象以“最近引用”的方式存储在包文件中
命令。重建最近历史所需的对象是
放在包装中较早的位置,它们会靠得很近。这
通常适用于操作系统磁盘缓存。
所有提交对象均按提交日期排序(最近的在前)
并存储在一起。这种放置和排序优化了磁盘
遍历历史图并提取基本提交所需的访问权限
信息(例如
git log
)。树和 blob 对象从树开始存储
第一个存储的(最近的)提交。每棵树都经过深度处理
第一种方式,存储尚未存储的任何对象
存储。这将重建所需的所有树木和斑点
最近的提交集中在一个地方。任何树木和斑点
尚未保存,但稍后提交所需的是
按照排序的提交顺序存储在下一个。
最终的对象排序稍微受到增量基选择的影响
如果选择一个对象作为增量表示及其基础对象
尚未存储,则其基础对象将立即存储在
德尔化对象本身。这可以防止由于以下原因而可能发生的磁盘缓存未命中
读取“自然”已存在的基础对象所需的非线性访问
稍后存储在包文件中。
Summary:
Git’s pack files are carefully constructed to effectively use disk caches and
provide “nice” access patterns for common commands and for reading recently referenced
objects.
Git’s pack file
format is quite flexible (see Documentation/technical/pack-format.txt,
or The Packfile in The Git Community Book).
The pack files store objects in two main
ways: “undeltified” (take the raw object data and deflate-compress
it), or “deltified” (form a delta against some other object then
deflate-compress the resulting delta data). The objects stored in
a pack can be in any order (they do not (necessarily) have to be
sorted by object type, object name, or any other attribute) and
deltified objects can be made against any other suitable object of the same type.
Git’s pack-objects command uses several heuristics to
provide excellent locality of reference for common
commands. These heuristics control both the selection of base
objects for deltified objects and the order of the objects. Each
mechanism is mostly independent, but they share some goals.
Git does form long chains of delta compressed objects, but the
heuristics try to make sure that only “old” objects are at the ends of
the long chains. The delta base cache (whose size is controlled by the
core.deltaBaseCacheLimit
configuration variable) is automaticallyused and can greatly reduce the number of “rebuilds” required for
commands that need to read a large number of objects (e.g.
git log
).-p
Delta Compression Heuristic
A typical Git repository stores a very large number of objects, so
it can not reasonably compare them all to find the pairs (and
chains) that will yield the smallest delta representations.
The delta base selection heuristic is based on the idea that the
good delta bases will be found among objects with similar filenames
and sizes. Each type of object is processed separately (i.e. an
object of one type will never be used as the delta base for an
object of another type).
For the purposes of delta base selection, the objects are sorted (primarily) by
filename and then size. A window into this sorted list is used to limit
the number of objects that are considered as potential delta bases.
If a “good enough”1 delta representation is not found for an object
among the objects in its window, then the object will not be delta
compressed.
The size of the window is controlled by the
--window=
option ofgit pack-objects
, or thepack.window
configuration variable. Themaximum depth of a delta chain is controlled by the
--depth=
option of
git pack-objects
, or thepack.depth
configurationvariable. The
--aggressive
option ofgit gc
greatly enlargesboth the window size and the maximum depth to attempt to create
a smaller pack file.
The filename sort clumps together the objects for entries with with
identical names (or at least similar endings (e.g.
.c
)). The sizesort is from largest to smallest so that deltas that remove data are
preferred to deltas that add data (since removal deltas have shorter
representations) and so that the earlier, larger objects (usually
newer) tend to be represented with plain compression.
1
What qualifies as “good enough” depends on the size of the object in question and its potential delta base as well as how deep its resulting delta chain would be.
Object Ordering Heuristic
Objects are stored in the pack files in a “most recently referenced”
order. The objects needed to reconstruct the most recent history are
placed earlier in the pack and they will be close together. This
usually works well for OS disk caches.
All the commit objects are sorted by commit date (most recent first)
and stored together. This placement and ordering optimizes the disk
accesses needed to walk the history graph and extract basic commit
information (e.g.
git log
).The tree and blob objects are stored starting with the tree from the
first stored (most recent) commit. Each tree is processed in a depth
first fashion, storing any objects that have not already been
stored. This puts all the trees and blobs required to reconstruct
the most recent commit together in one place. Any trees and blobs that
have not yet been saved but that are required for later commits are
stored next, in the sorted commit order.
The final object ordering is slightly affected by the delta base selection
in that if an object is selected for delta representation and its base object
has not been stored yet, then its base object is stored immediately before the
deltified object itself. This prevents likely disk cache misses due to the
non-linear access required to read a base object that would have “naturally” been
stored later in the pack file.
在包文件中使用增量存储只是一个实现细节。在该级别,Git 不知道某些内容从一个版本到下一个版本发生变化的原因或方式,而是只知道除了这些更改 C 之外,Blob B 与 Blob A 非常相似。因此,它只会存储 Blob A 和更改 C (如果它选择这样做 - 它也可以选择存储 Blob A 和 Blob B)。
从包文件中检索对象时,增量存储不会向调用者公开。调用者仍然看到完整的斑点。因此,Git 的工作方式与往常相同,没有增量存储优化。
The use of delta storage in the pack file is just an implementation detail. At that level, Git doesn't know why or how something changed from one revision to the next, rather it just knows that blob B is pretty similar to blob A except for these changes C. So it will only store blob A and changes C (if it chooses to do so - it could also choose to store blob A and blob B).
When retrieving objects from the pack file, the delta storage is not exposed to the caller. The caller still sees complete blobs. So, Git works the same way it always has without the delta storage optimisation.
正如我在“什么是 git 的精简包?”中提到的
我在“git 二进制差异算法(增量存储)标准化了吗?”中详细介绍了包文件使用的增量编码。 /a>"。
另请参阅“git 何时以及如何使用增量进行存储?”。
请注意,对于 Git 2.0.x/2.1(2014 年第 3 季度),控制包文件默认大小的
core.deltaBaseCacheLimit
配置很快就会从 16MB 增加到 96MB 。请参阅 David Kastrup 的 commit 4874f54(2014 年 5 月):
将 core.deltaBaseCacheLimit 提高到 96m
这是使用 Git 2.29(2020 年第 4 季度)进一步优化的,其中“
git index-pack
"(man) 学会了以更大的并行性来解析详细的对象。请参阅提交 f08cbf6(2020 年 9 月 8 日)和 提交 ee6f058, 提交 a7f7e84,提交46e6fb1,提交 fc968e2, 提交 009be0d(2020 年 8 月 24 日)作者:Jonathan Tan (
jhowtan
)。(由 Junio C Hamano --
gitster
-- 合并于 提交 b7e65b5,2020 年 9 月 22 日)使用 Git 2.31(2021 年第 1 季度),您可以了解有关格式的更多详细信息。
请参阅 提交 7b77f5a(2020 年 12 月 29 日),作者:Martin Ågren(
无
)。(由 Junio C Hamano --
gitster
-- 合并于 提交 16a8055,2021 年 1 月 15 日)technical/pack-format
现在包含在其 手册页:technical/pack-format
现在包含在其 手册页:As I mentioned in "What are git's thin packs?"
I detailed the delta encoding used for pack files in "Is the git binary diff algorithm (delta storage) standardized?".
See also "When and how does git use deltas for storage?".
Note that the
core.deltaBaseCacheLimit
config which controls the default size for the pack file will soon be bumped from 16MB to 96MB, for Git 2.0.x/2.1 (Q3 2014).See commit 4874f54 by David Kastrup (May 2014):
Bump core.deltaBaseCacheLimit to 96m
This is further optimized with Git 2.29 (Q4 2020), where "
git index-pack
"(man) learned to resolve deltified objects with greater parallelism.See commit f08cbf6 (08 Sep 2020), and commit ee6f058, commit b4718ca, commit a7f7e84, commit 46e6fb1, commit fc968e2, commit 009be0d (24 Aug 2020) by Jonathan Tan (
jhowtan
).(Merged by Junio C Hamano --
gitster
-- in commit b7e65b5, 22 Sep 2020)With Git 2.31 (Q1 2021), you have more details about the format.
See commit 7b77f5a (29 Dec 2020) by Martin Ågren (
none
).(Merged by Junio C Hamano --
gitster
-- in commit 16a8055, 15 Jan 2021)technical/pack-format
now includes in its man page:technical/pack-format
now includes in its man page: