如何识别并可能删除 SVN 存储库内的大型二进制提交?
我正在使用一个已有 3 年多历史的 SVN 存储库,包含超过 6,100 次提交,大小超过 1.5 GB。我想在将 SVN 存储库移动到新服务器之前减小其大小(我不是在谈论完整 SVN 导出的大小 - 我指的是服务器上存在的完整存储库)。
当前存储库包含我们所有软件项目的源代码,但它也包含相对较大的无意义的二进制文件,例如:
- 许多第三方工具的完整安装程序。
- .jpg 和.png 文件(位于同一文件夹中的 PSD 的未修改导出)。
- Bin 和 Obj 文件夹(然后在下一次提交时“svn 忽略”)。
- Resharper 目录。
其中许多大文件自添加以来已被“SVN 删除”,从而产生了识别最大罪犯的进一步问题。
我想要:
- 创建一个新的 SVN 存储库,其中仅包含所有软件项目的代码 - 复制的文件在旧存储库中维护其 SVN 历史记录非常重要。
- 从现有存储库中删除大型二进制提交和文件。
其中任何一个都可能吗?
I am working with an SVN repository that is over 3 years old, contains over 6,100 commits and is over 1.5 GB in size. I want to reduce the size of the SVN repository (I'm not talking about the size of a full SVN export - I mean the full repository as it would exist on the server) before moving it to a new server.
The current repository contains the source code for all of our software projects but it also contains relatively large binary files of no significance such as:
- Full installers for a number of 3rd party tools.
- .jpg & .png files (which are unmodified exports of PSDs that live in the same folder).
- Bin and Obj folders (which are then 'svn ignored' the next commit).
- Resharper directories.
A number of these large files have been 'SVN deleted' since they were added, creating a further problem of identifing the biggest offenders.
I want to either:
- Create a new SVN repository that contains only the code for all of the software projects - it is really important that the copied files maintain their SVN history from the old repository.
- Remove the large binary commits and files from the existing repository.
Are either of these possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
Otherside 关于 svnadmin dump 等的说法是正确的。类似这样的内容将为您提供一个粗略的指针,指向向您的存储库添加了大量数据的修订,并且是 svndumpfilter 的候选者:
您还可以尝试类似的方法来查找添加了具有特定扩展名的文件(此处为 .jpg)的修订:
Otherside is right about
svnadmin dump
, etc. Something like this will get you a rough pointer to revisions that added lots of data to your repo, and are candidates forsvndumpfilter
:You could also try something like this to find revisions that added files with a particular extension (here, .jpg):
您必须使用 svnadmin dump 获取当前存储库的转储文件,并可能 svndumpfilter 处理转储文件。只要小心,您也可以手动修改转储文件。
这可能不是一项快速而容易的工作,但它是可以完成的。我已经做了类似的事情,只是对一个小得多的存储库。我有一个包含大约 150 个修订的存储库,占用了大约 600MB 的空间。
从当前存储库进行转储,进行必要的更改并尝试将修改后的转储文件加载到新存储库中。然后检查新的存储库以确保一切仍然有意义(历史记录仍然正确,路径没有奇怪的变化,...)。
You will have to use svnadmin dump to get a dump file of your current repository and possibly svndumpfilter to process the dump file. You can also manually modify the dumpfile as long as you're carefull.
It's probably not going to be a quick and easy job, but it can be done. I've done something similar, only to a much smaller repository. I had a repo with about 150 revisions that took about 600MB.
Make a dump from your current repository, make the necessary changes and try to load the modified dumpfile in a new repository. Then check the new repository to make sure everything is still making sense (History is still correct, no weird changes in paths, ...).
如果您使用“SVN删除”从存储库中删除了文件,那么您实际上并没有删除这些文件。这就是 SVN 的美妙之处。一旦文件被添加到存储库,它就永远存在(除非使用转储和加载)。 “删除”文件后,您实际上创建了一个标记删除的新修订版,但文件仍然存在于以前的修订版中。
我已经做了一些转储&加载,但是到一个更大的存储库。大约 60,000 (!!!) 次修订。这花了一些时间,但最后,在仔细加载后,存储库再次构建。
您唯一的方法是列出文件添加、修改和删除的修订版本。然后转储中间的修订版本,并按正确的顺序加载它们。请注意,没有犯错的余地。如果你犯了错误,你将不得不重新开始。转储&从头开始加载。
我的建议是,如果大文件是一个这样的问题,请考虑创建一个没有历史记录的新存储库。保留旧的以供历史比较,并从新开始工作。
祝你好运。
If you deleted files from the repository using "SVN Delete", you didn't actually deleted the files. This would be the beauty of the SVN. Once a file is added to the repository, it is there forever (unless using dump & load). Upon "deleting" the files, you actually create a new revision that marks the deletion, but the files continue to exist in previous revisions.
I've done some dump & load, but to a much much bigger repository. Around 60,000 (!!!) revisions. It took time but at the end, after careful loading, the repository is again built.
Your only way is to list the revisions that the files were added, modified and deleted. Then dump the revisions in between, and load them in the right order. BE AWARE, there is no room for mistakes. If you make a mistake, you will have to start over. Dump & load from the start.
My suggestion, if the large files are such a problem, consider creating a newly fresh repository with no history. Keep the old one for history comparison, and start working from fresh.
Good Luck.
如果您只需要查找有问题的提交,您可以访问托管存储库的服务器:在存储库的 db/revs 子目录中查找大文件(假设它使用 fsfs 格式)。
If you just need to find the offending commits and you have access to the server hosting the repository: look for large files in db/revs subdirectory of the repository (assuming it uses fsfs format).
详细说明 Otherside 的答案,以下是对我特别有用的内容:
您可以通过将
Obj
和Bin
目录添加到svndumpfilter
来排除它们命令 – 我没有尝试过。此外,Subversion 的
fsfs-stats
程序(Subversion 1.8 中的新功能,在 1.9 中被svnfsfs stats
取代)对于量化文件类型和正在填充的特定文件可能很有用你的存储库。这对于以后比较存储库可能有用:
Elaborating on Otherside's answer, here's what specifically worked for me:
You might be able to exclude your
Obj
andBin
directories by adding them to thesvndumpfilter
command – I didn't try it.Also, Subversion's
fsfs-stats
program (new in Subversion 1.8, replaced by in 1.9 bysvnfsfs stats
) might be useful for quantifying the file types and specific files that are filling up your repository.This might be useful for comparing the repositories afterward:
这难道不是一个不同的问题,需要一个额外的步骤吗?也就是说,您需要找到您认为较大的二进制文件,然后检查它们是否确实由 SVN 管理或已在本地构建(或从并行资产系统导入,如果它已经就位)。
因此,只需找到这些文件,然后对它们执行
svn info
即可查明它们是否是存储库的一部分。Isn't this just a different problem, with an extra step? I.e. you need to locate files that you consider to be large and binary, and then check if they are indeed managed by SVN or have been built locally (or imported from the parallel asset system, if it's already in place).
So, just find the files, then do
svn info
on them to find out if they're part of the repository.只是一个小小的想法,你说存储库的当前状态(当前的 HEAD)很好,即大型二进制文件过去已被 svn 删除。因此,您的问题纯粹是存储库的大小?
我知道您说过您想保留所有提交历史记录,但作为一种选择,您可以执行两次转储,一次用于整个修订历史记录,一次用于当前 HEAD 修订。
例如,如果您将完整转储放在 DVD 上,那么您将可以在需要时使用数据,但是您可以删除整个存储库并 svn 加载修订转储,从而留下一个小型干净的存储库。
也可以从特定修订版开始转储,而不仅仅是头部,因此例如您可以保留最后 3 个月的修订版并将所有旧版本转储到 DVD 上......
Just a small thought, you say that the current state of the repository (the current HEAD) is good, i.e. the large binary files have been svn delete'ed in the past. Therefore your issue is purely the size of the repository?
I know you said you would like to keep all the commit history, but as an option, you could do two dumps, one for the whole revision history, and one for the current HEAD revision.
If you put the full dump on to a DVD for example you would have the data available if you ever needed it, but you could then delete the whole repository and svn load the revision dump, leaving you with a small clean repository.
it is also possible to dump from a specific revision onwards, rather than just the head, so for example you could keep the last 3 months of revisions and dump everything older on to a DVD....