如何真正理解"Git 保存的是文件快照"这一句话?

快照

git

绝活

32 条回复 • 2014-07-04 19:57:41 +08:00

1

chloerei

2014 年 7 月 4 日

就是每次修改都是整个文件保存。

2

chloerei

2014 年 7 月 4 日

2

于此对比，有些版本管理工具是保存每个版本之间的变化，这样虽然总文件体积小，但是每检出一个文件都要从最开始的版本一个个修改叠加上去，很慢。

3

kzing

OP

2014 年 7 月 4 日

@chloerei 这么说的话, 那就算是一个非常非常小的改动, 提交会也会有一个新的完整的文件诞生咯? 那这样, 随着版本的增多, 或者对于文件特别大的项目, Git 不会很吃力吗?

4

chloerei

2014 年 7 月 4 日

@kzinglzy 是的，如果项目很大，git 库会更大。git 设计出来是管理 linux 内核的，很多巨型项目（例如 Android）也是用 git，跟着走应该没问题。

Facebook's git repo is 54GB https://news.ycombinator.com/item?id=7648237

另外 git clone 的时候可以指定拷贝深度，减少拷贝体积。

5

jsonline

2014 年 7 月 4 日 via Android

@kzinglzy 本地复制文件能有多吃力。分析差异才吃力。

6

xujialiang

2014 年 7 月 4 日

貌似错了吧？

7

blacktulip

2014 年 7 月 4 日

1

Smaller Space Requirements

Git's repository and working directory sizes are extremely small when compared to SVN.

For example the Mozilla repository is reported to be almost 12 Gb when stored in SVN using the fsfs backend. Previously, the fsfs backend also required over 240,000 files in one directory to record all 240,000 commits made over the 10 year project history. This was fixed in SVN 1.5, where every 1000 revisions are placed in a separate directory. The exact same history is stored in Git by only two files totaling just over 420 Mb. This means that SVN requires 30x the disk space to store the same history.

One of the reasons for the smaller repo size is that an SVN working directory always contains two copies of each file: one for the user to actually work with and another hidden in .svn/ to aid operations such as status, diff and commit. In contrast a Git working directory requires only one small index file that stores about 100 bytes of data per tracked file. On projects with a large number of files this can be a substantial difference in the disk space required per working copy.

As a full Git clone is often smaller than a full checkout, Git working directories (including the repositories) are typically smaller than the corresponding SVN working directories. There are even ways in Git to share one repository across many working directories, but in contrast to SVN, this requires the working directories to be colocated.

8

dorentus

2014 年 7 月 4 日

http://stackoverflow.com/a/8198276/90172

9

dorentus

2014 年 7 月 4 日

@dorentus 这段：

While that's true and important on the conceptual level, it is NOT true at the storage level. Git does use deltas for storage. Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta-compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. – Jan Hudec Jan 3 '13 at 13:51

10

est

2014 年 7 月 4 日

@chloerei 就是不能指定子路径clone。。。。

11

akfish

2014 年 7 月 4 日

要真正理解的话，了解一下Git内部实现，并且亲自从Git最底层的命令撸一次常见命令：
http://git-scm.com/book/en/Git-Internals

简单的说，Git把文件的每一个版本存储为blob文件，并不做diff，只有在pack的时候，才会有算法按一定的策略计算delta。

12

akfish

2014 年 7 月 4 日

13

akfish

2014 年 7 月 4 日

Git唯一会计算差异的地方就是pack，关于pack算法，参见：
https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt

14

dorentus

2014 年 7 月 4 日

@dorentus 这评论里面说的应该是 git-gc 所做的事情吧。

按我的理解，git gc 压缩完之后，历史 commit 就不一定每个都是完整的快照了。然后由评论里所说，由于 git 保存的不是每个文件的差异，而就是一个个文件，它压缩的时候可以方便地找出相似的文件来一起压缩，不少情况下反而会更省空间。

15

generic

2014 年 7 月 4 日

1

@chloerei 你的理解是错误的。

git的底层(plumbing layer)*接口*处理的单位是整个文件（以及整个目录树等等），这并不意味着底层*实现*直接在磁盘上存快照而不使用delta压缩。

进入任何一个git项目的.git/objects目录，那些以两位十六进制数命名的子目录里保存的就是未压缩的原始对象（文件快照），但这只是最近创建的还没来得及压缩的一小部分对象。在pack目录中你会看见以压缩形态存储的大部分对象。

通过隐藏“文件的版本之间的delta”这一细节而只暴露“文件快照”这一概念，git底层反而实现了更高的存储效率。因为delta不再局限于一个文件的相邻版本之间。如果你的项目里有两个文件内容相同，git只会存储一份对象；如果两个文件内容相似，git也可以对它们作delta压缩。这都是暴露了delta机制的系统，如svn, bzr没有做到的。

16

chloerei

2014 年 7 月 4 日

@generic 提问，如果由 A 改为 A'，两个文件不同了，git 储存了两份文件，这个有误吗？我知道文件如果没变化，就不会产生新的文件。

17

generic

2014 年 7 月 4 日

1

@chloerei 取决于你把说呢么叫做“储存了两份文件”。

概念上，git底层（实际上是一个CAS: content-addressed storage）存储了两个对象。

实际上，在你磁盘文件系统上则未必有两份文件。在你刚刚commit的时候A'很可能以独立的文件存在；在你手动运行git repack或者由git自动运行之后，则以pack的形式压缩在一起。

18

dorentus

2014 年 7 月 4 日

看来我把 git repack 和 git gc 搞混了..

19

chloerei

2014 年 7 月 4 日

@generic ……两个数据块。因为直觉上 A + A' 的体积要比 A + ' 大，还是说 pack 压缩后并不占用 A + A' 的体积，相当于 A + '？

20

generic

2014 年 7 月 4 日

1

@dorentus git gc除了删除垃圾文件，也会自动调用git repack，所以你没搞混。

21

c742435

2014 年 7 月 4 日

从编程复杂度来讲，反而是保存差异更高级呢
不过 git多用于代码，而且现在磁盘容量也够大了，无所谓了

22

c742435

2014 年 7 月 4 日

git底层是zip吗

23

dorentus

2014 年 7 月 4 日

@c742435 保存差异或许实现起来更麻烦，但不能算高级吧……

24

akfish

2014 年 7 月 4 日

@generic

git的底层(plumbing layer)*接口*处理的单位是整个文件（以及整个目录树等等），这并不意味着底层*实现*直接在磁盘上存快照而不使用delta压缩。
--------------------------------------------------------------------

底层实现的确是直接在磁盘上存快照，pack不是必须，并且总是离线发生。而存blob总会立即发生，并且总是先于pack发生。在任何情况下，git总是会先**直接**存储两份物理意义上的文件。

物理文件和概念文件是否一一对应并不值得纠结，就好比讨论Linux文件系统的文件时，你不会纠结inode在哪个扇区、是否和文件一一对应。

真正决定Git本质的，就是概念上的存文件快照而不存差异。

25

jokester

2014 年 7 月 4 日

@c742435 git object是用zlib压缩的

26

yxz00

2014 年 7 月 4 日

@generic 理论上保存2份拷贝压缩不见得比保存delta再压缩小吧。除非svn类的软件从来不对自己的repo进行压缩，否则git的保存方式应该比svn大才对。

27

dorentus

2014 年 7 月 4 日

@yxz00 git pack 不是以同一个文件为单位压缩的，即使是来源于不同文件的 object，只要 git 觉得它们足够相似，都会一起压缩。

然后 git 一般比 svn 占用空间小的原因其实是因为 svn tag/branch 都其实是复制出去一份副本的，git 不是。

28

dorentus

2014 年 7 月 4 日

1

另外对于楼主后一个问题
“就算是一个非常非常小的改动, 提交会也会有一个新的完整的文件诞生咯? 那这样, 随着版本的增多, 或者对于文件特别大的项目, Git 不会很吃力吗?”
其实也就是文件数量会比较多或者文件大小比较大吧：文件数量会比较多的问题可以由 git repack 解决；文件大小比较大的仅限于放到 repo 里面的单个文件大小比较大的情况，也就是会影响 git diff 的性能吧。

29

akfish

2014 年 7 月 4 日

@yxz00 压缩后文件大小只是一方面，还有一个很重要的点就是存取效率、传输效率。
一堆分离的delta小文件，比起良好的pack成一坨的大文件，IO pattern要差得多
Git pack的算法专门就有策略保证pack出来的文件locality良好。

另外一点就是传输的时候，Git可以先双方协商，看对方缺什么文件，有什么文件，然后总是在对方已有文件的基础上，寻找最优的pack策略，在有的时候就能获得比全局diff更低的带宽占用。

30

yxz00

2014 年 7 月 4 日

@dorentus 直觉上git不会用这么不“git”的方式去做压缩。这个”足够相似“太玄乎了。看来还是得自己读代码看看。

31

akfish

2014 年 7 月 4 日

2

Git的pack策略：

<gitster> The quote from the above linus should be rewritten a
bit (wait for it):
- first sort by type. Different objects never delta with
each other.
- then sort by filename/dirname. hash of the basename
occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
DIR_BITS are for the hash of leading path elements.
- then if we are doing "thin" pack, the objects we are _not_
going to pack but we know about are sorted earlier than
other objects.
- and finally sort by size, larger to smaller.

<gitster> That's the sort order. What this means is:
- we do not delta different object types.
- we prefer to delta the objects with the same full path, but
allow files with the same name from different directories.
- we always prefer to delta against objects we are not going
to send, if there are some.
- we prefer to delta against larger objects, so that we have
lots of removals.

The penultimate rule is for "thin" packs. It is used when
the other side is known to have such objects.

32

generic

2014 年 7 月 4 日

1

@chloerei 这涉及pack的很多实现细节，特别是pack本身还对文件作压缩，所以并不是你的两个选项那么简单。

不过我们也可以作个小实验：

~/tmp$ ls
~/tmp$ mkdir test
~/tmp$ cd test
~/tmp/test$ git init
初始化空的 Git 版本库于 /home/jin/tmp/test/.git/
~/tmp/test (master #%)$ dd if=/dev/urandom of=a bs=1 count=1000000
记录了1000000+0 的读入
记录了1000000+0 的写出
1000000字节(1.0 MB)已复制，2.53473 秒，395 kB/秒
~/tmp/test (master #%)$ git add a
~/tmp/test (master #)$ git commit -m 'initial commit'
[master（根提交） 710fdd3] initial commit
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 a
~/tmp/test (master)$ du -csh .git
1.2M .git
1.2M 总用量
~/tmp/test (master)$ dd if=/dev/urandom of=a bs=1 count=500000 oflag=append conv=notrunc
记录了500000+0 的读入
记录了500000+0 的写出
500000字节(500 kB)已复制，1.26773 秒，394 kB/秒
~/tmp/test (master *)$ ls -l a
-rw-r--r-- 1 jin users 1500000 7月 4 18:06 a
~/tmp/test (master *)$ git add a
~/tmp/test (master +)$ git commit -m 'A dash'
[master d47c454] A dash
1 file changed, 0 insertions(+), 0 deletions(-)
~/tmp/test (master)$ du -csh .git
2.6M .git
2.6M 总用量
~/tmp/test (master)$ git gc
对象计数中: 6, 完成.
Delta compression using up to 4 threads.
压缩对象中: 100% (4/4), 完成.
写入对象中: 100% (6/6), 完成.
Total 6 (delta 1), reused 0 (delta 0)
~/tmp/test (master)$ du -csh .git
1.6M .git
1.6M 总用量