大型二进制文件和 > 1TB 存储库的版本控制?
抱歉再次提出这个话题,因为有 soo 很多 其他问题已经相关 - 但没有一个直接涵盖我的问题。
我正在寻找的是一个很好的版本控制系统,它只能处理两个简单的要求:
- 存储大型二进制文件(>1GB)
- 支持>1TB(是的,那就是TB)的存储库
为什么?我们正在为下一个大型操作系统部署重新打包数千个软件应用程序,我们希望这些软件包遵循版本控制。
到目前为止,我已经有了一些使用 SVN 和 CVS 的经验,但是我对两者在处理大型二进制文件时的性能不太满意(一些 MSI 或 CAB 文件将大于 1GB)。另外,我不确定它们是否能够很好地适应我们预计未来 2-5 年的数据量(就像我说的,估计> 1TB)
那么,您有什么建议吗? 我目前也在研究 SVN 外部以及 Git 子模块,尽管这意味着每个软件包都有多个单独的存储库,但我不确定这是否是我们想要的。
Sorry to come up with this topic again, as there are soo many other questions already related - but none that covers my problem directly.
What I'm searching is a good version control system that can handle only two simple requirements:
- store large binary files (>1GB)
- support a repository that's >1TB (yes, that's TB)
Why? We're in the process of repackaging a few thousand software applications for our next big OS deployment and we want those packages to follow version control.
So far I've got some experience with SVN and CVS, however I'm not quite satisfied with the performance of both with large binary files (a few MSI or CAB files will be >1GB). Also, I'm not sure if they scale well with the amount of data we're expecting in the next 2-5 years (like I said, estimated >1TB)
So, do you have any recommendations?
I'm currently also looking into SVN Externals as well as Git Submodules, though that would mean several individual repositories for each software package and I'm not sure that's what we want..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
看看 Boar,“照片、视频和其他二进制文件的简单版本控制和备份”。它可以轻松处理巨大的文件和巨大的存储库。
Take a look at Boar, "Simple version control and backup for photos, videos and other binary files". It can easily handle huge files and huge repositories.
老问题,但也许值得指出的是,许多大公司都在使用 Perforce,特别是游戏开发公司,这些公司拥有包含许多大型二进制文件的多 TB 存储库。
(免责声明:我在 Perforce 工作)
Old question, but perhaps worth pointing out that Perforce is in use at lots of large companies, and particular in games development companies, where multi-Terabyte repositories with many large binary files.
(Disclaimer: I work at Perforce)
的,这是 Apache Subversion 应该完全支持的情况之一。
最新的 Apache Subversion 服务器和客户端在控制如此大量的数据时应该没有问题,并且它们可以完美地扩展。此外,如果您有多个站点且开发人员在处理同一项目,则可以使用各种存储库复制方法来提高性能。
svn:externals
与对大型二进制文件或多 TB 项目的支持无关。 Subversion 在单个存储库中完美扩展并支持非常大的数据和代码库。但 Git 却不。 使用 Git,您必须将项目划分并拆分到多个小型存储库< /a>.这将导致很多缺点和持续的 PITA。这就是为什么 Git 有很多附加组件(例如 git-lfs)来试图减轻问题的痛苦。Yep, that is one of the cases Apache Subversion should fully support.
Up-to-date Apache Subversion servers and clients should have no problems controlling such amount of data and they perfectly scale. Moreover, there are various repository replication approaches that should improve performance in case you have multiple sites with developers working on the same projects.
svn:externals
have nothing to do with the support for large binaries or multiterabyte projects. Subversion perfectly scales and supports very large data and code base in a single repository. But Git does not. With Git, you'll have to divide and split the projects to multiple small repositories. This is going to lead to a lot of drawbacks and a constant PITA. That's why Git has a lot of add-ons such as git-lfs that try to make the problem less painful.当您确实必须使用 VCS 时,我会使用 svn,因为 svn 不需要将整个存储库复制到工作副本。但它仍然需要大约重复的磁盘空间量,因为它为每个文件都有一个干净的副本。
有了这些数据量,我会寻找一个文档管理系统,或者(低级)使用具有定义的输入过程的只读网络共享。
When you really have to use a VCS, i would use svn, since svn does not require to copy the entire repository to the working copy. But it still needs about the duplicate amount of disk space, since it has a clean copy for each file.
With these amount of data I would look for a document management system, or (low level) use a read-only network share with a defined input process.
2017 年 5 月更新:
Git 通过添加 GVFS(Git 虚拟文件系统),可以支持虚拟任意数量、任意大小的文件(从 Windows 存储库本身开始:“地球上最大的 Git 存储库"(350 万个文件,320GB)。
这还没有超过 1TB,但可以扩展到那里。
使用 GVFS 完成的工作正在慢慢向上游提出(即 Git 本身),但这仍然是一项正在进行的工作。
GVFS 在 Windows 上实现,但很快就会在 Mac 上实现(因为开发 Office for Mac 的 Windows 团队需要它)和 Linux。
2015 年 4 月
Git 实际上可以被视为大数据的可行 VCS,Git 大文件存储 (LFS)(由 GitHub 发布,2015 年 4 月)。
git-lfs (参见git-lfs.github.com)可以使用支持它的服务器进行测试:lfs-test-server (或直接使用 github.com 本身):< br>
您只能将元数据存储在 git 存储库中,而将大文件存储在其他地方。
Update May 2017:
Git, with the addition of GVFS (Git Virtual File System), can support virtually any number of files of any size (starting with the Windows repository itself: "The largest Git repo on the planet" (3.5M files, 320GB).
This is not yet >1TB, but it can scale there.
The work done with GVFS is slowly proposed upstream (that is to Git itself), but that is still a work in progress.
GVFS is implement on Windows, but will soon be done for Mac (because the team at Windows developing Office for Mac demands it), and Linux.
April 2015
Git can actually be considered as a viable VCS for large data, with Git Large File Storage (LFS) (by GitHub, april 2015).
git-lfs (see git-lfs.github.com) can be tested with a server supporting it: lfs-test-server (or directly with github.com itself):
You can store metadata only in the git repo, and the large file elsewhere.
版本控制系统适用于源代码,而不是二进制构建。您最好只使用标准网络文件服务器备份磁带进行二进制文件备份 - 尽管当您拥有源代码控制时这在很大程度上是不必要的,因为您可以随时重建任何二进制文件的任何版本。尝试将二进制文件置于源代码控制中是一个错误。
您真正谈论的是一个称为配置管理的过程。如果您有数千个独特的软件包,您的企业应该有一个配置经理(一个人,而不是软件;-))来管理用于开发、测试、发布、每个客户发布等的所有配置(也称为构建) 。
Version control systems are for source code, not binary builds. You are better off just using standard network file server backup tapes for binary file backup - even though it's largely unnecessary when you have source code control since you can just rebuild any version of any binary at any time. Trying to put binaries in source code control is a mistake.
What you are really talking about is a process known as configuration management. If you have thousands of unique software packages, your business should have a configuration manager (a person, not software ;-) ) who manages all of the configurations (a.k.a. builds) for development, testing, release, release-per-customer, etc.
只需依赖某些 NAS 设备,该设备可以提供 文件系统可访问的快照 以及单实例存储 / 块级重复数据删除,考虑到您所描述的数据规模...
(问题还提到了 .cab 和 .msi 文件:通常是 CI 软件 您选择的有一些归档构建的方法。这是您最终想要的吗?)
You might be much better off by simply relying on some NAS device that would provide a combination of filesystem-accessible snapshots together with single instance store / block level deduplication, given the scale of data you are describing ...
(The question also mentions .cab & .msi files: usually the CI software of your choice has some method of archiving builds. Is that what you are ultimately after?)
这是一个老问题,但一个可能的答案是 https://www. Plasticscm.com/。他们的 VCS 可以处理非常大的文件和非常大的存储库。几年前我们选择的时候他们是我的选择,但管理层把我们推到了别处。
This is an old question, but one possible answer is https://www.plasticscm.com/. Their VCS can handle very large files and very large repositories. They were my choice when we were choosing a couple years ago, but management pushed us elsewhere.
有几家公司提供“广域文件共享”产品。它们可以将大文件复制到不同的位置,但具有分布式锁定机制,因此只有一个人可以处理任何副本。当一个人签入更新的副本时,该副本将被复制到其他站点。主要用途是CAD/CAM文件和其他大型文件。请参阅 Peer Software (http://www.peersoftware.com/index.aspx) 和 GlobalSCAPE (http://www.globalscape.com/)。
There are a couple of companies with products for "Wide Area File Sharing." They can replicate large files to different locations, but have distributed locking mechanisms so only one person can work on any of the copies. When a person checks in an updated copy, that is replicated to the other sites. The major use is CAD/CAM files and other large files. See Peer Software (http://www.peersoftware.com/index.aspx) and GlobalSCAPE (http://www.globalscape.com/).
如果您只关心版本控制元数据功能而实际上并不关心旧数据,那么使用 VCS 而不将数据存储在 VCS 中的解决方案可能是一个可接受的选择。
git-annex 是我想到的第一个,但是来自 git-annex 不是什么 页面似乎还有其他类似但不完全相同的替代方案。
我没有使用过 git-annex,但从描述和演练来看,它似乎适合您的情况。
If you only care about the versioning metadata features and don't actually care about the old data then a solution that uses a VCS without storing the data in the VCS may be an acceptable option.
git-annex is the first one that came to my mind, but from the what git-annex is not page it seems there are other similar but not exactly the same alternatives.
I have not used git-annex, but from the description and walkthrough it sounds like it could work for your situation.