Git 的包文件是增量文件而不是快照吗?

发布于 2024-10-20 01:55:24 字数 772 浏览 3 评论 0 原文

Git 和大多数其他版本控制系统之间的一个主要区别是,其他版本控制系统倾向于将提交存储为一系列增量 - 一次提交与下一次提交之间的更改集。这似乎是合乎逻辑的,因为它是存储有关提交的尽可能少的信息。但是提交历史记录越长,比较修订范围所需的计算就越多。

相比之下,Git 在每个版本中存储整个项目的完整快照。这不会使存储库大小随着每次提交而急剧增长,因为项目中的每个文件都作为文件存储在 Git 子目录中,并以其内容的哈希值命名。因此,如果内容没有改变,哈希值也没有改变,提交只是指向同一个文件。还有其他优化。

所有这些对我来说都是有意义的,直到我偶然发现有关包文件的信息,其中Git 定期放置数据以节省空间:

为了节省空间,Git 使用 packfile。这是一个 Git 只会保存的格式 第二部分发生变化的部分 文件,带有指向该文件的指针 类似于。

这不是基本上回到了存储增量吗?如果不是,有什么不同?这如何避免让 Git 遇到与其他版本控制系统相同的问题?

例如,Subversion 使用增量,回滚 50 个版本意味着撤消 50 个差异,而使用 Git,您只需获取适当的快照。除非 git 还在包文件中存储 50 个差异...是否有某种机制说“在一些少量的增量之后,我们将存储一个全新的快照”,这样我们就不会堆积太大的变更集? Git 还能如何避免增量的缺点呢?

One of the key differences between Git and most other version control systems is that the others tend to store commits as a series of deltas - changesets between one commit and the next. This seems logical, since it's the smallest possible amount of information to store about a commit. But the longer the commit history gets, the more calculation it takes to compare ranges of revisions.

By contrast, Git stores a complete snapshot of the whole project in each revision. The reason this doesn't make the repo size grow dramatically with each commit is each file in the project is stored as a file in the Git subdirectory, named for the hash of its contents. So if the contents haven't changed, the hash hasn't changed, and the commit just points to the same file. And there are other optimizations as well.

All this made sense to me until I stumbled on this information about pack files, into which Git puts data periodically to save space:

In order to save that space, Git
utilizes the packfile. This is a
format where Git will only save the
part that has changed in the second
file, with a pointer to the file it is
similar to.

Isn't this basically going back to storing deltas? If not, how is it different? How does this avoid subjecting Git to the same problems other version controls systems have?

For example, Subversion uses deltas, and rolling back 50 versions means undoing 50 diffs, whereas with Git you can just grab the appropriate snapshot. Unless git also stores 50 diffs in the packfiles... is there some mechanism that says "after some small number of deltas, we'll store a whole new snapshot" so that we don't pile up too large a changeset? How else might Git avoid the disadvantages of deltas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

念﹏祤嫣 2024-10-27 01:55:24

摘要:
Git 的包文件经过精心构建,可以有效地使用磁盘缓存和
为常用命令和最近引用的阅读提供“良好”的访问模式
对象。


Git的包文件
格式非常灵活(参见 文档/技术/pack-format.txt,
Packfile 中的 Git 社区书籍)。
包文件将对象存储在两个主要位置
方式:“undeltified”(获取原始对象数据并压缩压缩
它),或“deltified”(与其他对象形成增量,然后
deflate-压缩生成的增量数据)。存储的对象在
包可以是任何顺序(它们不必(必然)是
按对象类型、对象名称或任何其他属性排序)和
可以针对相同类型的任何其他合适的对象来制作细节化的对象。

Git 的 pack-objects 命令使用多个 启发式
为常见问题提供出色的参考地点
命令。这些启发式方法控制着碱基的选择
用于增量对象的对象和对象的顺序。每个
机制大多是独立的,但它们有一些共同的目标。

Git 确实形成了增量压缩对象的长链,但是
启发式尝试确保只有“旧”对象位于末尾
长链子。增量基本缓存(其大小由
core.deltaBaseCacheLimit 配置变量)是自动的
使用,可以大大减少所需的“重建”次数
需要读取大量对象的命令(例如git log
-p
)。

Delta 压缩启发式

典型的 Git 存储库存储大量对象,因此
它无法合理地比较它们全部以找到配对(并且
链),这将产生最小的增量表示。

Delta 碱基选择启发式基于以下思想:
将在具有相似文件名的对象中找到良好的增量基础
和尺寸。每种类型的对象都单独处理(即
一种类型的对象永远不会用作某个类型的增量基础
另一种类型的对象)。

出于增量基础选择的目的,对象(主要)按
文件名,然后是大小。此排序列表的窗口用于限制
被视为潜在增量基础的对象数量。
如果未找到对象的“足够好”1 增量表示
在其窗口中的对象之间,则该对象不会是 delta
压缩的。

窗口的大小由--window=选项控制
git pack-objects 或 pack.window 配置变量。这
Delta 链的最大深度由 --depth= 控制
git pack-objects 选项,或 pack.depth 配置
多变的。 git gc--aggressive 选项大大放大了
尝试创建的窗口大小和最大深度
较小的包文件。

文件名排序将条目的对象聚集在一起 with
相同的名称(或至少相似的结尾(例如 .c))。尺寸
排序是从最大到最小,以便删除数据的增量是
优于添加数据的增量(因为删除增量的时间更短)
表示),因此较早、较大的物体(通常
较新的)倾向于用普通压缩来表示。

1
什么才算“足够好”取决于所讨论对象的大小及其潜在的增量基础以及由此产生的增量链的深度。

对象排序启发式

对象以“最近引用”的方式存储在包文件中
命令。重建最近历史所需的对象是
放在包装中较早的位置,它们会靠得很近。这
通常适用于操作系统磁盘缓存。

所有提交对象均按提交日期排序(最近的在前)
并存储在一起。这种放置和排序优化了磁盘
遍历历史图并提取基本提交所需的访问权限
信息(例如git log)。

树和 blob 对象从树开始存储
第一个存储的(最近的)提交。每棵树都经过深度处理
第一种方式,存储尚未存储的任何对象
存储。这将重建所需的所有树木和斑点
最近的提交集中在一个地方。任何树木和斑点
尚未保存,但稍后提交所需的是
按照排序的提交顺序存储在下一个。

最终的对象排序稍微受到增量基选择的影响
如果选择一个对象作为增量表示及其基础对象
尚未存储,则其基础对象将立即存储在
德尔化对象本身。这可以防止由于以下原因而可能发生的磁盘缓存未命中
读取“自然”已存在的基础对象所需的非线性访问
稍后存储在包文件中。

Summary:
Git’s pack files are carefully constructed to effectively use disk caches and
provide “nice” access patterns for common commands and for reading recently referenced
objects.


Git’s pack file
format is quite flexible (see Documentation/technical/pack-format.txt,
or The Packfile in The Git Community Book).
The pack files store objects in two main
ways: “undeltified” (take the raw object data and deflate-compress
it), or “deltified” (form a delta against some other object then
deflate-compress the resulting delta data). The objects stored in
a pack can be in any order (they do not (necessarily) have to be
sorted by object type, object name, or any other attribute) and
deltified objects can be made against any other suitable object of the same type.

Git’s pack-objects command uses several heuristics to
provide excellent locality of reference for common
commands. These heuristics control both the selection of base
objects for deltified objects and the order of the objects. Each
mechanism is mostly independent, but they share some goals.

Git does form long chains of delta compressed objects, but the
heuristics try to make sure that only “old” objects are at the ends of
the long chains. The delta base cache (whose size is controlled by the
core.deltaBaseCacheLimit configuration variable) is automatically
used and can greatly reduce the number of “rebuilds” required for
commands that need to read a large number of objects (e.g. git log
-p
).

Delta Compression Heuristic

A typical Git repository stores a very large number of objects, so
it can not reasonably compare them all to find the pairs (and
chains) that will yield the smallest delta representations.

The delta base selection heuristic is based on the idea that the
good delta bases will be found among objects with similar filenames
and sizes. Each type of object is processed separately (i.e. an
object of one type will never be used as the delta base for an
object of another type).

For the purposes of delta base selection, the objects are sorted (primarily) by
filename and then size. A window into this sorted list is used to limit
the number of objects that are considered as potential delta bases.
If a “good enough”1 delta representation is not found for an object
among the objects in its window, then the object will not be delta
compressed.

The size of the window is controlled by the --window= option of
git pack-objects, or the pack.window configuration variable. The
maximum depth of a delta chain is controlled by the --depth=
option of git pack-objects, or the pack.depth configuration
variable. The --aggressive option of git gc greatly enlarges
both the window size and the maximum depth to attempt to create
a smaller pack file.

The filename sort clumps together the objects for entries with with
identical names (or at least similar endings (e.g. .c)). The size
sort is from largest to smallest so that deltas that remove data are
preferred to deltas that add data (since removal deltas have shorter
representations) and so that the earlier, larger objects (usually
newer) tend to be represented with plain compression.

1
What qualifies as “good enough” depends on the size of the object in question and its potential delta base as well as how deep its resulting delta chain would be.

Object Ordering Heuristic

Objects are stored in the pack files in a “most recently referenced”
order. The objects needed to reconstruct the most recent history are
placed earlier in the pack and they will be close together. This
usually works well for OS disk caches.

All the commit objects are sorted by commit date (most recent first)
and stored together. This placement and ordering optimizes the disk
accesses needed to walk the history graph and extract basic commit
information (e.g. git log).

The tree and blob objects are stored starting with the tree from the
first stored (most recent) commit. Each tree is processed in a depth
first fashion, storing any objects that have not already been
stored. This puts all the trees and blobs required to reconstruct
the most recent commit together in one place. Any trees and blobs that
have not yet been saved but that are required for later commits are
stored next, in the sorted commit order.

The final object ordering is slightly affected by the delta base selection
in that if an object is selected for delta representation and its base object
has not been stored yet, then its base object is stored immediately before the
deltified object itself. This prevents likely disk cache misses due to the
non-linear access required to read a base object that would have “naturally” been
stored later in the pack file.

≈。彩虹 2024-10-27 01:55:24

在包文件中使用增量存储只是一个实现细节。在该级别,Git 不知道某些内容从一个版本到下一个版本发生变化的原因或方式,而是只知道除了这些更改 C 之外,Blob B 与 Blob A 非常相似。因此,它只会存储 Blob A 和更改 C (如果它选择这样做 - 它也可以选择存储 Blob A 和 Blob B)。

从包文件中检索对象时,增量存储不会向调用者公开。调用者仍然看到完整的斑点。因此,Git 的工作方式与往常相同,没有增量存储优化。

The use of delta storage in the pack file is just an implementation detail. At that level, Git doesn't know why or how something changed from one revision to the next, rather it just knows that blob B is pretty similar to blob A except for these changes C. So it will only store blob A and changes C (if it chooses to do so - it could also choose to store blob A and blob B).

When retrieving objects from the pack file, the delta storage is not exposed to the caller. The caller still sees complete blobs. So, Git works the same way it always has without the delta storage optimisation.

蓝眼睛不忧郁 2024-10-27 01:55:24

正如我在“什么是 git 的精简包?”中提到的

Git 仅在包文件中进行增量化

我在“git 二进制差异算法(增量存储)标准化了吗?”中详细介绍了包文件使用的增量编码。 /a>"。
另请参阅“
git 何时以及如何使用增量进行存储?”。

请注意,对于 Git 2.0.x/2.1(2014 年第 3 季度),控制包文件默认大小的 core.deltaBaseCacheLimit 配置很快就会从 16MB 增加到 96MB 。

请参阅 David Kastrup 的 commit 4874f54(2014 年 5 月):

将 core.deltaBaseCacheLimit 提高到 96m

默认的 16m 会导致大增量链与大文件结合时出现严重的抖动。

以下是一些基准测试(git Blame 的 pu 变体):

time git blame -C src/xdisp.c >/dev/null

用于使用 git gc --aggressive 重新打包的 Emacs 存储库(v1.9,导致窗口大小为 250),位于 SSD 驱动器上。
该文件大约有 30000 行,大小为 1Mb,历史记录约为
2500 次提交。

16m (previous default):
  real  3m33.936s
  user  2m15.396s
  sys   1m17.352s

96m:
  real  2m5.668s
  user  1m50.784s
  sys   0m14.288s

这是使用 Git 2.29(2020 年第 4 季度)进一步优化的,其中“git index-pack"(man) 学会了以更大的并行性来解析详细的对象。

请参阅提交 f08cbf6(2020 年 9 月 8 日)和 提交 ee6f058, 提交 a7f7e84提交46e6fb1提交 fc968e2, 提交 009be0d(2020 年 8 月 24 日)作者:Jonathan Tan (jhowtan)
(由 Junio C Hamano -- gitster -- 合并于 提交 b7e65b5,2020 年 9 月 22 日)

index-pack:进行大量工作较小

签字人:Jonathan Tan

目前,当索引包解析增量时,它不会将增量树拆分为线程:每个增量基本根(不是 REF_DELTAOFS_DELTA 的对象) 可以进入其自己的线程,但该根上的所有增量(直接或间接)都在同一线程中处理。

当存储库包含多次修改的大型文本文件(因此可增量)时,就会出现问题 - 获取期间的增量解析时间主要通过处理与该文本文件对应的增量来控制。

此补丁包含一个解决方案。
克隆时使用

git -c core.deltabasecachelimit=1g 克隆 \
  https://fuchsia.googlesource.com/third_party/vulkan-cts  

在我的笔记本电脑上,克隆时间从 3 米 2 秒缩短到 2 米 5 秒(使用 3 个线程,这是默认值)。

解决方案是拥有一个全局工作堆栈。该堆栈包含需要处理的增量基础(对象,无论是直接出现在包文件中还是通过增量解析生成,它们本身都具有增量子级);每当线程需要工作时,它就会查看堆栈顶部并处理下一个未处理的子线程。如果线程发现堆栈为空,它将寻找更多的增量基根来压入堆栈。

拥有全局工作堆栈的主要缺点是在互斥锁上花费了更多时间,但分析表明大部分时间都花在了增量本身的解析上,因此这在实践中不应该成为问题。无论如何,实验(如上面的克隆命令中所述)表明该补丁是一个净改进。


使用 Git 2.31(2021 年第 1 季度),您可以了解有关格式的更多详细信息。

请参阅 提交 7b77f5a(2020 年 12 月 29 日),作者:Martin Ågren(
(由 Junio C Hamano -- gitster -- 合并于 提交 16a8055,2021 年 1 月 15 日)

pack-format.txt:文档大小在增量数据开始时

报告人:Ross Light
签字人:Martin Ågren

我们将增量数据记录为一组指令,但忘记记录这些指令之前的两个大小:基础对象的大小和要重建的对象的大小。
修正这个遗漏。

不要将有关编码的所有细节都塞进运行文本中,而是引入一个单独的部分来详细说明我们的“大小编码”并参考它。

technical/pack-format 现在包含在其 手册页

尺寸编码

本文档使用以下非负“大小编码”
整数:每个字节的七个最低有效位是
用于形成结果整数。
只要最重要的
bit为1,该过程继续; MSB 0 的字节提供
最后七位。

七位块被连接起来。
之后
值更显着。

此大小编码不应与“偏移编码”混淆,
本文档中也使用了该方法。

technical/pack-format 现在包含在其 手册页

增量数据从基础对象的大小和
要重建的对象的大小。这些尺寸是
使用上面的大小编码进行编码。

剩余的
增量数据是重建对象的指令序列

As I mentioned in "What are git's thin packs?"

Git does deltification only in packfiles

I detailed the delta encoding used for pack files in "Is the git binary diff algorithm (delta storage) standardized?".
See also "When and how does git use deltas for storage?".

Note that the core.deltaBaseCacheLimit config which controls the default size for the pack file will soon be bumped from 16MB to 96MB, for Git 2.0.x/2.1 (Q3 2014).

See commit 4874f54 by David Kastrup (May 2014):

Bump core.deltaBaseCacheLimit to 96m

The default of 16m causes serious thrashing for large delta chains combined with large files.

Here are some benchmarks (pu variant of git blame):

time git blame -C src/xdisp.c >/dev/null

for a repository of Emacs repacked with git gc --aggressive (v1.9, resulting in a window size of 250) located on an SSD drive.
The file in question has about 30000 lines, 1Mb of size, and a history with about
2500 commits.

16m (previous default):
  real  3m33.936s
  user  2m15.396s
  sys   1m17.352s

96m:
  real  2m5.668s
  user  1m50.784s
  sys   0m14.288s

This is further optimized with Git 2.29 (Q4 2020), where "git index-pack"(man) learned to resolve deltified objects with greater parallelism.

See commit f08cbf6 (08 Sep 2020), and commit ee6f058, commit b4718ca, commit a7f7e84, commit 46e6fb1, commit fc968e2, commit 009be0d (24 Aug 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit b7e65b5, 22 Sep 2020)

index-pack: make quantum of work smaller

Signed-off-by: Jonathan Tan

Currently, when index-pack resolves deltas, it does not split up delta trees into threads: each delta base root (an object that is not a REF_DELTA or OFS_DELTA) can go into its own thread, but all deltas on that root (direct or indirect) are processed in the same thread.

This is a problem when a repository contains a large text file (thus, delta-able) that is modified many times - delta resolution time during fetching is dominated by processing the deltas corresponding to that text file.

This patch contains a solution to that.
When cloning using

git -c core.deltabasecachelimit=1g clone \
  https://fuchsia.googlesource.com/third_party/vulkan-cts  

on my laptop, clone time improved from 3m2s to 2m5s (using 3 threads, which is the default).

The solution is to have a global work stack. This stack contains delta bases (objects, whether appearing directly in the packfile or generated by delta resolution, that themselves have delta children) that need to be processed; whenever a thread needs work, it peeks at the top of the stack and processes its next unprocessed child. If a thread finds the stack empty, it will look for more delta base roots to push on the stack instead.

The main weakness of having a global work stack is that more time is spent in the mutex, but profiling has shown that most time is spent in the resolution of the deltas themselves, so this shouldn't be an issue in practice. In any case, experimentation (as described in the clone command above) shows that this patch is a net improvement.


With Git 2.31 (Q1 2021), you have more details about the format.

See commit 7b77f5a (29 Dec 2020) by Martin Ågren (none).
(Merged by Junio C Hamano -- gitster -- in commit 16a8055, 15 Jan 2021)

pack-format.txt: document sizes at start of delta data

Reported-by: Ross Light
Signed-off-by: Martin Ågren

We document the delta data as a set of instructions, but forget to document the two sizes that precede those instructions: the size of the base object and the size of the object to be reconstructed.
Fix this omission.

Rather than cramming all the details about the encoding into the running text, introduce a separate section detailing our "size encoding" and refer to it.

technical/pack-format now includes in its man page:

Size encoding

This document uses the following "size encoding" of non-negative
integers: From each byte, the seven least significant bits are
used to form the resulting integer.
As long as the most significant
bit is 1, this process continues; the byte with MSB 0 provides the
last seven bits.

The seven-bit chunks are concatenated.
Later
values are more significant.

This size encoding should not be confused with the "offset encoding",
which is also used in this document.

technical/pack-format now includes in its man page:

The delta data starts with the size of the base object and the
size of the object to be reconstructed. These sizes are
encoded using the size encoding from above.

The remainder of
the delta data is a sequence of instructions to reconstruct the object

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文