如何将版本控制与数据分析结合起来
我使用 R、Python、PostgreSQL 等工具组合进行大量单独的数据分析,以及完成工作所需的任何工具。我使用版本控制软件(目前是 Subversion,尽管我在旁边使用 git)来管理我的所有脚本,但数据永远是一个挑战。我的脚本往往会运行很长一段时间(几个小时,有时甚至几天)来生成小型或大型数据集,然后我将其用作更多脚本的输入。
我面临的挑战是,如果我想检查较早时间点的脚本,如何“回滚”我所做的事情。获取旧脚本很容易。如果我将数据放入版本控制中,获取旧数据会很容易,但传统观点似乎是将数据排除在版本控制之外,因为它是如此庞大和繁琐。
我的问题:如何将处理后的数据与代码上的版本控制系统结合和/或管理?
I do a lot of solo data analysis, using a combination of tools such as R, Python, PostgreSQL, and whatever I need to get the job done. I use version control software (currently Subversion, though I'm playing around with git on the side) to manage all of my scripts, but the data is perpetually a challenge. My scripts tend to run for a long period of time (hours, or occasionally days) to generate small or large datasets, which I in turn use as input for more scripts.
The challenge I face is in how to "rollback" what I do if I want to check out my scripts from an earlier point in time. Getting the old scripts is easy. Getting the old data would be easy if I put my data into version control, but conventional wisdom seems to be to keep data out of version control because it's so darned big and cumbersome.
My question: how do you combine and/or manage your processed data with a version control system on your code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Subversion,也许还有其他 [d]vcs,都支持符号链接。这个想法是将原始数据“组织良好”存储在文件系统上,同时在版本控制下使用符号链接跟踪“脚本”和“生成日期”之间的关系。
您的所有脚本都将调用
加载数据
来检索给定的数据集,并通过版本化符号链接链接到给定的数据集。使用这种方法,可以在一个工具中跟踪代码和计算数据集,而不会用二进制数据使存储库膨胀。
Subversion, maybe other [d]vcs as well, supports symbolic links. The idea is to store raw data 'well organized' on a filesystem, while tracking the relation between 'script' and 'generated date' with symbolic links under version control.
All your scripts will call
load data
to retrieve a given dataset, being linked through versioned symbolic link to a given dataset.Using this approach, code and calculated datasets are tracked within one tool, without bloating your repository with binary data.