大型项目的分布式版本控制 - 可行吗?

发布于 2024-08-25 13:45:24 字数 564 浏览 3 评论 0原文

我们现在对 SVN 非常满意,但是 Joel 的教程引起了我的兴趣。所以我想知道 - 这在我们的情况下也可行吗?

问题是 - 我们的 SVN 存储库非常庞大。该软件本身已有 15 年历史,并且已经在多个不同的源代码控制系统中幸存下来。有超过 68,000 个修订(变更集),源代码本身占用超过 100MB,我什至无法开始猜测整个存储库消耗了多少 GB。

那么问题就很简单了——整个存储库的克隆可能需要很长时间才能完成,并且会消耗远程理智的驱动​​器上更多的空间。由于分布式版本控制的关键是根据需要拥有尽可能多的存储库,因此我开始产生怀疑。

Mercurial(或任何其他分布式版本控制)如何处理这个问题?或者它们无法用于如此巨大的项目?

添加:澄清一下 - 整个项目是一个庞大的项目,它编译为单个 .EXE 并且无法拆分。

添加2:第二个想法 - Linux 内核存储库使用 git,可能比我的大一个或两个数量级。那么他们如何让它发挥作用呢?

We're pretty happy with SVN right now, but Joel's tutorial intrigued me. So I was wondering - would it be feasible in our situation too?

The thing is - our SVN repository is HUGE. The software itself has a 15 years old legacy and has survived several different source control systems already. There are over 68,000 revisions (changesets), the source itself takes up over 100MB and I cant even begin to guess how many GB the whole repository consumes.

The problem then is simple - a clone of the whole repository would probably take ages to make, and would consume far more space on the drive that is remotely sane. And since the very point of distributed version control is to have a as many repositories as needed, I'm starting to get doubts.

How does Mercurial (or any other distributed version control) deal with this? Or are they unusable for such huge projects?

Added: To clarify - the whole thing is one monolithic beast of a project which compiles to a single .EXE and cannot be split up.

Added 2: Second thought - The Linux kernel repository uses git and is probably an order of magnitude or two bigger than mine. So how do they make it work?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

拥抱没勇气 2024-09-01 13:45:24

大型项目的分布式版本控制 - 可行吗?

绝对地!如您所知,Linux 规模庞大并使用 Git。 Mercurial 也用于一些主要项目,例如 Python、Mozilla、OpenSolaris 和 Java。

我们现在对 SVN 非常满意,但 Joel 的教程引起了我的兴趣。所以我想知道 - 这在我们的情况下也可行吗?

是的。如果您现在对 Subversion 感到满意,那么您可能没有进行太多分支和合并!

事实是 - 我们的 SVN 存储库非常庞大。 [...] 有超过 68,000 个修订(变更集),源代码本身占用超过 100MB

正如其他人指出的那样,与许多现有项目相比,这实际上并没有那么大。

问题很简单 - 整个存储库的克隆可能需要很长时间才能完成,并且会消耗远程理智的驱动​​器上更多的空间。

Git 和 Mercurial 在管理存储方面都非常高效,而且它们的存储库占用的空间比同等的 Subversion 存储库(已经转换了一些)少得多。一旦您进行了初步结帐,您只需推动增量,这非常快。它们在大多数操作中都明显更快。初始克隆是一次性成本,因此需要多长时间并不重要(我打赌您会感到惊讶!)。

由于分布式版本控制的关键是根据需要拥有尽可能多的存储库,因此我开始产生怀疑。

磁盘空间很便宜。开发人员的生产力更为重要。那么如果repo占用1GB怎么办?如果你能更聪明地工作,那就值得了。

Mercurial(或任何其他分布式版本控制)如何处理这个问题?或者它们无法用于如此巨大的项目?

可能值得阅读一下 使用 Mercurial 的项目(例如 Mozilla)如何管理转换过程。其中大多数都有多个存储库,每个存储库都包含主要组件。 Mercurial 和 Git 也都支持嵌套存储库。还有一些工具可以管理转换过程 - Mercurial 内置支持从大多数其他系统导入

补充:为了澄清 - 整个事情是一个项目的整体野兽,它编译为单个 .EXE 并且不能拆分。

这使得它变得更容易,因为您只需要一个存储库。

添加 2:第二个想法 - Linux 内核存储库使用 git,可能比我的大一个或两个数量级。那么他们是如何让它发挥作用的呢?

Git 是为原始速度而设计的。磁盘格式、线路协议、内存算法都经过了大量优化。他们开发了复杂的工作流程,补丁从单个开发人员流向子系统维护人员,直至中尉,最终到达 Linus。 DVCS 的最佳优点之一是它们非常灵活,可以支持各种工作流程。

我建议您阅读 Bryan O'Sullivan 撰写的关于 Mercurial 的优秀书籍,这本书将帮助您了解速度快。下载 Mercurial 并完成示例,并在一些临时存储库中使用它以感受它。

然后启动 convert 命令来导入现有的源存储库。然后尝试进行一些本地更改、提交、分支、查看日志、使用内置 Web 服务器等。然后将其克隆到另一个盒子并进行一些更改。对最常见的操作进行计时,并查看其比较情况。您可以免费进行完整的评估,但需要花费一些时间。

Distributed version control for HUGE projects - is it feasible?

Absolutely! As you know, Linux is massive and uses Git. Mercurial is used for some major projects too, such as Python, Mozilla, OpenSolaris and Java.

We're pretty happy with SVN right now, but Joel's tutorial intrigued me. So I was wondering - would it be feasible in our situation too?

Yes. And if you're happy with Subversion now, you're probably not doing much branching and merging!

The thing is - our SVN repository is HUGE. [...] There are over 68,000 revisions (changesets), the source itself takes up over 100MB

As others have pointed out, that's actually not so big compared to many existing projects.

The problem then is simple - a clone of the whole repository would probably take ages to make, and would consume far more space on the drive that is remotely sane.

Both Git and Mercurial are very efficient at managing the storage, and their repositories take up far less space than the equivalent Subversion repo (having converted a few). And once you have an initial checkout, you're only pushing deltas around, which is very fast. They are both significantly faster in most operations. The initial clone is a one-time cost, so it doesn't really matter how long it takes (and I bet you'd be surprised!).

And since the very point of distributed version control is to have a as many repositories as needed, I'm starting to get doubts.

Disk space is cheap. Developer productivity matters far more. So what if the repo takes up 1GB? If you can work smarter, it's worth it.

How does Mercurial (or any other distributed version control) deal with this? Or are they unusable for such huge projects?

It is probably worth reading up on how projects using Mercurial such as Mozilla managed the conversion process. Most of these have multiple repos, which each contain major components. Mercurial and Git both have support for nested repositories too. And there are tools to manage the conversion process - Mercurial has built-in support for importing from most other systems.

Added: To clarify - the whole thing is one monolithic beast of a project which compiles to a single .EXE and cannot be split up.

That makes it easier, as you only need the one repository.

Added 2: Second thought - The Linux kernel repository uses git and is probably an order of magnitude or two bigger than mine. So how do they make it work?

Git is designed for raw speed. The on-disk format, the wire protocol, the in-memory algorithms are all heavily optimized. And they have developed sophisticated workflows, where patches flow from individual developers, up to subsystem maintainers, up to lieutenants, and eventually up to Linus. One of the best things about DVCS is that they are so flexible they enable all sorts of workflows.

I suggest you read the excellent book on Mercurial by Bryan O'Sullivan, which will get you up to speed fast. Download Mercurial and work through the examples, and play with it in some scratch repos to get a feel for it.

Then fire up the convert command to import your existing source repository. Then try making some local changes, commits, branches, view logs, use the built-in web server, and so on. Then clone it to another box and push around some changes. Time the most common operations, and see how it compares. You can do a complete evaluation at no cost but some of your time.

宣告ˉ结束 2024-09-01 13:45:24

100MB的源代码比Linux内核还要少。 Linux 内核 2.6.33 和 2.6.34-rc1 之间的更新日志有 6604 次提交。你的存储库规模听起来并不吓人。

  • 从 .tar.bz2 存档中解压缩的 Linux 内核 2.6.34-rc1:445MB
  • 从主 Linus 树中检出的 Linux 内核 2.6 头:827MB

两倍,但与我们拥有的大硬盘相比仍然微不足道。

100MB of source code is less than the Linux kernel. Changelog between Linux kernel 2.6.33 and 2.6.34-rc1 has 6604 commits. Your repository scale doesn't sound intimidating to me.

  • Linux kernel 2.6.34-rc1 uncompressed from .tar.bz2 archive: 445MB
  • Linux kernel 2.6 head checked out from main Linus tree: 827MB

Twice as much, but still peanuts with the big hard drives we all have.

怀中猫帐中妖 2024-09-01 13:45:24

不用担心存储库空间要求。我的轶事:当我将代码库从 SVN 转换为 git(完整的历史 - 我认为)时,我发现克隆使用的空间比 WVN 工作目录少。 SVN 保留所有签出文件的原始副本:在任何 SVN 签出中查看 $PWD/.svn/text-base/ 。使用 git,整个历史记录占用的空间更少。

真正令我惊讶的是 git 的网络效率如何。我在一个连接良好的地方对一个项目进行了 git 克隆,然后将其放在闪存盘上带回家,并使用 git fetch / git pull 保持最新状态>,只有我微不足道的 GPRS 连接。我不敢在 SVN 控制的项目中做同样的事情。

你确实应该至少尝试一下。我想您会惊讶于您以集中式 VCS 为中心的假设是多么错误。

Don't worry about repository space requirements. My anecdote: when I converted our codebase from SVN to git (full history - I think), I found that the clone used less space than just the WVN working directory. SVN keeps a pristine copy of all your checked-out files: look at $PWD/.svn/text-base/ in any SVN checkout. With git the entire history takes less space.

What really surprised me was how network-efficient git is. I did a git clone of a project at a well-connected place, then took it home on a flash disk, where I keep it up to date with git fetch / git pull, with just my puny little GPRS connection. I wouldn't dare to do the same in an SVN-controlled project.

You really owe it to yourself to at least try it. I think you'll be amazed at just how wrong your centralised-VCS-centric assumptions were.

听,心雨的声音 2024-09-01 13:45:24

您需要所有历史记录吗?如果您只需要最近一两年的数据,您可以考虑将当前存储库保留为只读状态以供历史参考。然后通过执行 svnadmin dump 具有下限修订版,它构成了新的分布式存储库的基础。

我确实同意另一个答案,即 100MB 工作副本和 68K 修订版并没有那么大。试一试。

Do you need all history? If you only need the last year or two, you could consider leaving the current repository in a read-only state for historical reference. Then create a new repository with only recent history by performing svnadmin dump with the lower bound revision, which forms the basis for your new distributed repository.

I do agree with the other answer that 100MB working copy and 68K revisions isn't that big. Give it a shot.

坐在坟头思考人生 2024-09-01 13:45:24

你说你对 SVN 很满意...那么为什么要改变呢?

就分布式版本控制系统而言,Linux 使用 git,Sun 使用 Mercurial。两者都是令人印象深刻的大型源代码存储库,并且它们运行得很好。是的,您最终会在所有工作站上进行所有修订,但这就是您为去中心化付出的代价。请记住,存储很便宜 - 我的开发笔记本电脑目前拥有 1TB (2x500GB) 的硬盘存储。您是否测试过将 SVN 存储库拉入 Git 或 Mercurial 之类的东西,以实际看看它会占用多少空间?

我的问题是——作为一个组织,你准备好去中心化了吗?对于软件商店来说,保留中央存储库通常更有意义(定期备份、连接到 CruiseControl 或 FishEye、更易于控制和管理)。

如果您只是想要比 SVN 更快或更可扩展的东西,那么只需购买商业产品 - 我已经使用了 Perforce 和 Rational ClearCase,它们可以毫无问题地扩展到大型项目。

You say you're happy with SVN... so why change?

As far as distributed version control systems go, Linux uses git and Sun use Mercurial. Both are impressively large source code repositories, and they work just fine. Yes, you end up with all revisions on all workstations, but that's the price you pay for decentralisation. Remember storage is cheap - my development laptop currently has 1TB (2x500GB) of hard disk storage on board. Have you tested pulling your SVN repo into something like Git or Mercurial to actually see how much space it would take?

My question would be - are you ready as an organisation to go decentralised? For a software shop it usually makes much more sense to keep a central repository (regular backups, hook ups to CruiseControl or FishEye, easier to control and administer).

And if you just want something faster or more scalable than SVN, then just buy a commercial product - I've used both Perforce and Rational ClearCase and they scale up to huge projects without any problems.

絕版丫頭 2024-09-01 13:45:24

您可以将一个巨大的存储库拆分为许多较小的存储库,每个存储库对应旧存储库中的每个模块。这样人们就可以将他们以前拥有的任何 SVN 项目简单地保存为存储库。所需空间并不比以前多多少。

You'd split your one huge repository into lots of smaller repositories, each for each module in your old repo. That way people would simply hold as repositories whatever SVN projects they would have had before. Not much more space required than before.

蓝眼泪 2024-09-01 13:45:24

我在一个相当大的 c#/.net 项目(1 个解决方案中的 68 个项目)上使用 git,并且新鲜获取完整树的 TFS 占用空间约为 500Mb。 git 存储库在本地存储了大量提交,重量约为 800Mb。 git 内部的压缩和存储工作方式非常出色。看到如此小的空间内包含如此多的变化,真是令人惊讶。

I am using git on a fairly large c#/.net project (68 projects in 1 solution) and the TFS footprint of a fresh fetch of the full tree is ~500Mb. The git repo, storing a fair amount of commits locally weighs in at ~800Mb. The compaction and the way that storage works internally in git is excellent. It is amazing to see so many changes packed in to such a small amount of space.

本王不退位尔等都是臣 2024-09-01 13:45:24

根据我的经验,Mercurial 非常擅长处理大量文件和庞大的历史记录。缺点是您不应签入大于 10 Mb 的文件。我们使用 Mercurial 来保存我们编译的 DLL 的历史记录。不建议将二进制文件放入源代码控制中,但我们还是尝试了(它是专用于二进制文件的存储库)。该存储库大约有 2 Gig,我们不太确定将来是否能够继续这样做。无论如何,对于源代码我认为你不需要担心。

From my experience, Mercurial is pretty good at handling a large number of files and a huge history. The drawback is that you shouldn't check-in files bigger than 10 Mb. We used Mercurial to keep an history of our compiled DLL. It's not recommend to put binaries in a source countrol but we tried it anyway (it was a repository dedicated to the binaries). The repository was about 2 Gig and we are not too sure that we will be able to keep doing that in the future. Anyway, for source code I don't think you need to worry.

2024-09-01 13:45:24

Git 显然可以处理像您的项目一样大的项目,因为正如您所指出的,Linux 内核本身就更大。

Mercurial 和 Git 面临的挑战(不知道您是否管理大文件)是它们无法管理大文件(到目前为止)。

我有过将您规模的项目(也持续了 15 年)从 CVS/SVN(实际上是两者的混合)迁移到 Plastic SCM 的分布式和集中式(这两个工作流程同时发生在同一组织内)发展。

这种迁移永远不会是无缝的,因为它不仅是一个技术问题,而且涉及很多人(像您这样大的项目可能涉及数百名开发人员,不是吗?),但有一些进口商可以实现自动化迁移和培训也可以很快完成。

Git can obviously work with a project as big as yours since, as you pointed, Linux kernel alone is bigger.

The challenge (don't know if you manage big files) with Mercurial and Git is that they can't manage big files (so far).

I've experience moving a project your size (and around for 15 years too) from CVS/SVN (a mix of the two actually) into Plastic SCM for distributed and centralized (the two workflows happening inside the same organization at the same time) development.

The move will never be seamless since it's not only a tech problem but involves a lot of people (a project as big as yours probably involves several hundreds of developers, doesn't it?), but there are importers to automate migration and training can be done very fast too.

近箐 2024-09-01 13:45:24

不,不起作用。那么您不需要任何需要客户端大量存储的东西。如果你的数据这么大(例如图像等),那么存储空间就需要比普通工作站更多的存储空间才能高效。

那么你最好选择集中化的东西。简单的数学——在每个工作站上拥有大量 GB 并保持高效是不可能的。这根本没有意义。

No, does not work. You dont want anything that requires signiciant storage on client side then. If you get that large (by toring fo rexample images etc.), the storage requires more than a normal workstation has anyway to be efficient.

You better go with something centralized then. Simple math - it simlpy is not feasible to have tond of gb on every workstation AND be efficient there. It simply makes no sense.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文