如何干净地处理存储库中的源代码和数据
我正在开展一个协作科学项目,该项目由一些 Python 脚本(最多 1M)和一个相对较大的数据集(1.5 GB)组成。数据集与 python 脚本紧密相连,因为数据集本身就是科学,而脚本是它们的简单接口。
我使用 Mercurial 作为我的源代码控制工具,但我不清楚定义存储库的良好机制。从逻辑上讲,将它们捆绑在一起是有意义的,这样通过克隆存储库您就可以获得整个包。另一方面,我担心处理大量数据的源代码控制工具。
有没有一个干净的机制来处理这个问题?
I'm working on a collaborative scientific project that is made up by a handful of Python scripts (1M max) and a relatively large dataset (1.5 GB). The datasets are tightly linked to the python scripts since the datasets themselves are the science and the scripts are a simple interface to them.
I'm using Mercurial as my source control tool, but I am not clear on a good mechanism to define the repository. Logistically it makes sense to bundle these together so that by cloning the repository you'd get the entire package. On the other hand, I'm concerned about the source control tool dealing with large amounts of data.
Is there a clean mechanism to handle this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果数据文件很少更改,并且您通常需要所有这些文件,那么只需将它们添加到 Mercurial 中即可完成。您的所有克隆都是 1.5 GB,但这正是处理该数据量所必须的方式。
如果数据是二进制数据并且经常更改,那么您可能会尝试避免下载所有旧数据。一种方法是使用 Subversion 子存储库。您将有一个
.hgsub
文件,该文件告诉 Mercurial 从右侧 URL 进行
svn checkout
并将 Subversion 工作副本作为放入您的 Mercurial 克隆中>数据。 Mercurial 将为您维护一个名为
.hgsubstate
的附加文件,其中记录了 SVN 修订号以签出任何给定的 Mercurial 变更集。通过像这样使用 Subversion,您最终只能在计算机上获得最新版本的数据,但 Mercurial 将知道如何在需要时获取旧版本的数据。如果您选择此路线,请参阅子存储库指南。If the data files change rarely and you normally need all of them anyway, then just add them to Mercurial and be done with it. All your clones will be 1.5 GB, but that is just the way it has to be with that amount of data.
if the data is binary data and changed often, then you might try to avoid downloading all the old data. One way to do this is to use a Subversion subrepository. You will have a
.hgsub
file withwhich tells Mercurial to make a
svn checkout
from the right-hand side URL and put the Subversion working copy into your Mercurial clone asdata
. Mercurial will maintain an additional file for you called.hgsubstate
, in which it records the SVN revision number to checkout for any given Mercurial changeset. By using Subversion like this, you only end up with the latest version of the data on your machine, but Mercurial will know how to get older versions of the data when needed. Please see this guide to subrepositories if you go down this route.官方 wiki 上有一篇关于大型二进制文件的文章。但@MartinGeisler 的提议是一个非常好的新选择。
There is an article on the official wiki about large binary files. But the proposition of @MartinGeisler is a really nice new alternative.
我的第一个倾向是将 python 脚本分离到它们自己的存储库中,但我确实需要更多的域信息来做出“正确”的调用。
一方面,如果要创建新的数据集,那么您会希望有一组核心工具能够处理所有这些数据集,对吗?但我也可以看到新的数据集如何引入脚本以前可能未处理过的情况...尽管在理想的世界中您似乎希望以通用方式编写脚本,以便它们可以处理未来的数据和现有的数据集??
My first inclination is to separate the python scripts out into their own repository, but I really need more domain information to make the "right" call.
On the one hand, if new datasets will be created then you would want a core set of tools to be able to handle all of them, right? But I can also see how new datasets may introduce cases that the scripts may not have previously handled... although it seems like in an ideal world you would want scripts that are written in a general way so they can handle future data and existing datasets??