如何将“修订控制”与“版本控制”结合起来?与“工作流程”对于R?
我记得遇到 R 用户写道他们使用“版本控制”(例如:“源代码控制”< /a>),我很想知道:您如何将“修订控制”与统计分析工作流程结合起来?
两个(非常)有趣的讨论讨论了如何处理工作流程。但它们都没有提到修订控制元素:
很长问题更新:根据一些人的回答以及评论中德克的问题,我想进一步提出我的问题。
在阅读了关于“修订控制”的 Wiki 文章(我之前对此并不熟悉),我很清楚,当使用修订控制时,人们所做的是构建他的代码的开发结构。这种结构要么导致“最终产品”,要么导致几个分支。
当构建诸如网站之类的东西时。您通常会致力于一个最终产品(网站),并在此过程中制作一些原型。
但在进行统计分析时,工作(在我看来)是不同的。有时你知道自己想去哪里。但更多时候,你会探索。探索清理数据集。探索不同的统计分析方法,并对数据提出各种问题(我正在写这篇文章,了解 Frank Harrell 和其他经验丰富的统计学家对 数据挖掘)。
这就是为什么统计编程的工作流程问题(在我看来)是一个严肃而深刻的问题,引发了许多问题,更简单的问题是技术性的:
- 您使用哪种版本控制软件(以及为什么)?
- 您使用哪种 IDE(以及为什么)? 更有趣的问题是关于工作流程的:
- 您如何构建文件?
- 您将哪些内容保留为单独文件,哪些内容保留为修订版本?或者以不同的方式询问 - 代码中什么应该是“分支”,什么应该是“子项目”?例如:当开始探索数据时,是否应该创建一个图,然后将其删除,因为它没有引导到任何地方(但保留为修订版),或者应该有该路径的备份文件?
我最初的好奇心是如何你解决这种紧张局势的。第二个问题是“我可能会错过什么?”。为了避免使用版本控制进行统计编程的常见陷阱,应该遵循哪些(经验)规则?
在我的直觉中,我觉得统计编程与软件开发有着本质上的不同(我在写这篇文章时并不是统计编程方面的真正专家,更不是软件开发方面的专家)。因此,我不确定我在这里读到的有关版本控制的哪些课程适用。
多谢, 塔尔
I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis workflow?
Two (very) interesting discussions talk about how to deal with the workflow. But neither of them refer to the revision control element:
A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more.
After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches.
When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way.
But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging).
That is why the workflow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical:
- Which revision control software do you use (and why) ?
- Which IDE do you use(and why) ?
The more interesting question are about work process: - How do you structure your files?
- What do you keep as a separate file and what as a revision? or asking in a different way - What should be a "branch" and what should be a "sub project" in your code? For example: When starting to explore your data, should a plot be creating and then erased because it didn't lead any where (but kept as a revision) or should there be a backup file of that path?
How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control?
In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable.
Thanks a lot,
Tal
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我的工作流程与 Bernd 的没有太大不同。我通常有一个主目录,用于放置所有 *.R 代码文件。一旦文本文件中的行数超过 5 行,我就会开始版本控制,在我的例子中是 git。我的大部分工作都不在团队环境中,这意味着我是唯一更改代码的人。一旦我做出实质性改变(是的,这是主观的),我就会进行检查。我同意 Dirk 的观点,即这个过程与工作流程是正交的。
我使用 Eclipse + StatET,虽然 Eclipse 中有一个 git 插件(EGit 可能还有其他插件) ,我不使用它。我在 Windows 中,只使用 git-gui for Windows。这里有更多选项。
版本控制中存在很大的个人特质空间,但我建议将这一技巧作为最佳实践:如果您向其他人(即期刊文章、您的团队、公司的管理层)报告结果始终在运行输出给其他人的结果之前进行版本控制检查。 3 个月后,总会有人查看您的结果并询问一些有关代码的问题,除非您知道生成这些结果时代码的确切状态,否则您无法回答这些问题。因此,将其作为一种实践,并添加注释“这是我用于第四季度财务数据的代码版本”或任何您的用例。
另请记住,版本控制并不能替代良好的备份计划。我的座右铭是:“3份副本。2个地理。1个平静的心灵。”
编辑(2010 年 2 月 24 日): Stack Overflow 创始人之一 Joel Spolsky 刚刚发布了 Mercurial 的高度视觉化和非常酷的介绍。如果您尚未选择版本控制系统,那么仅本教程就可能是采用 Mercurial 的理由。我认为当谈到 Git 与 Mercurial 时,最重要的建议是选择一个并使用它。也许可以使用您的朋友/同事使用的工具,或者使用具有最佳教程的工具。但已经用过一个了! ;)
My workflow is not that different than Bernd's. I usually have a main directory where I put all my *.R code files. As soon as I have more than about 5 lines in a text file I start version control, in my case git. Most of my work is not in a team context meaning that I'm the only one changing my code. As soon as I make a substantive change (yes that is subjective) I do a check in. I agree with Dirk that this process is orthogonal to the workflow.
I use Eclipse + StatET and while there is a plugin for git in Eclipse (EGit and probably others), I don't use it. I'm in Windows and just use git-gui for Windows. Here's some more options.
There's a lot of room for personal idiosyncrasies in version control, but I recommend this one tip as a best practice: if you report results to others (i.e. journal article, your team, management in your firm) ALWAYS do a version control check in right before running results that go out to others. Invariably, 3 months later someone will look at your results and ask some question about the code which you can't answer unless you know the EXACT state of the code when you produced those results. So make it a practice and put in the comments "this is the version of the code that I used for 4th quarter financials" or whatever your use case is.
Also keep in mind that version control is no replacement for a good backup plan. My motto is: "3 copies. 2 geographies. 1 mind at peace."
EDIT (Feb 24, 2010): Joel Spolsky, one of the founders of Stack Overflow, just released a highly visual and very cool intro to Mercurial. This tutorial alone may be reason to adopt Mercurial if you have not already chosen a revision control system. I think when it comes to Git vs. Mercurial the most important advice is to chose one and use it. Maybe use what your friends/coworkers use or use the one with the best tutorial. But just use one already! ;)
听起来您并不是特别关注版本控制,而是真正提出了一个更大的问题:统计分析与软件开发相比如何。这是一个有趣的问题。以下是一些想法:
数据分析可能更像是一门艺术而不是一门科学。从某种意义上说,您可能更想从作者在写书时遵循的过程中寻找灵感,而不是软件开发人员遵循的过程。另一方面,我还没有遇到过一个遵循直线的软件项目。即使在理论层面上,软件开发方法也存在很大差异。其中,鉴于统计分析可以是一个发现过程(即无法预先完全计划的过程),遵循诸如 敏捷方法(更类似于瀑布方法)。换句话说,您需要为您的分析做好迭代和自我反思的计划。
也就是说,我认为统计分析纯粹是探索性的、没有目标的观念可能是有问题的。这可能会导致你的灵光乍现时刻已经过去了 5 步,并且无法再回到原来的状态。总是有某种目标,即使目标本身正在改变。而且,如果没有目标,你怎么知道什么时候到达了终点呢?
一种方法是在启动项目时从一个 R 文件(或 Josh 和 Bernd 示例中的一组文件)开始,并随着您的发现逐步添加到其中(以便其大小增长)。当您有需要保留数据作为分析的一部分时尤其如此。应定期对该文件进行版本控制,以确保您在犯错误时始终可以后退(允许增量收益)。版本控制系统对开发非常有帮助,不仅因为它们确保您不会丢失东西,还因为它们为您提供了时间表。并标记您的签到,以便您一目了然地了解其中的内容,并记下主要的里程碑。我喜欢 JD 关于在提交内容之前进行检查的观点。
一旦得出最终结论,通常最好创建文件的最终版本,从头到尾总结您的分析。您甚至可以考虑将其放入 Sweave 文档中,以便它完全独立且可读写。
您还应该认真考虑周围其他人在做什么。没有什么比看到人们重新发明轮子更让我感到畏缩的了,尤其是当这意味着整个团队需要付出额外的工作来融入时。
您关于使用哪个版本控制系统、哪个 IDE 等(实现问题)的决定最终在与整个项目管理相关的图腾柱上处于极低的位置。只要正确使用其中的任何一个,就已经完成了 95% 的任务,而且与不使用任何选项相比,它们之间的差异很小。
最后,如果您使用 github、google code 或 R-forge 之类的东西,您会注意到它们都有一个共同点:一套不仅仅是版本控制系统的工具。也就是说,您应该考虑使用问题跟踪系统和 wiki 等工具来记录进度并记录未解决的问题/任务。您的分析越有条理,成功的可能性就越大。
Rather than focusing on revision control in particular, it sounds like you're really asking a bigger question about how statistical analysis compares to software development. That's an interesting question. Here are some thoughts:
Data analysis can be more like an art than a science. In a sense, you might want to look for inspiration to the process that an author would follow when writing a book more than the process that a software developer would follow. On the other hand, I have yet to encounter a software project that followed a straight line. And even at a theoretical level, there is a great amount of variance in software development methodologies. Of these, given that a statistical analysis can be a discovery process (i.e. one that can't be fully planned up front), it would make sense to follow something like an agile methodology (much more so that something like the waterfall methodology). In other words, you need to plan for your analysis to be iterative and self-reflective.
That said, I think the notion that statistical analysis is purely exploratory with no goal in mind is potentially problematic. That can lead to the point where you are 5 steps past your eureka moment, and have no way to get back to it. There is always a goal of some sort, even if the goal itself is changing. Moreover, if there is no goal, how will you know when you've reached the end?
One approach is to start off with one R file as you start a project (or a set of files like in the Josh and Bernd examples), and progressively add to it (so that it grows in size) as you make discoveries. This is also especially true when you have data that needs to be kept as part of the analysis. This file should be version controlled regularly to ensure that you can always step backwards if you make mistakes (allowing to incremental gains). Version control systems are immensely helpful in development not just because they ensure that you don't lose things, but also because they provide you with a timeline. And tag your check-ins so that you know what's in them at a glance, and note major milestones. I love JD's point about checking in before submitting something.
Once you have reached your final set of conclusions, it's often best to create a final version of your file that summarizes your analysis from start to end. You might even consider putting this into a Sweave document so that it's fully self-contained and literate.
You should also give serious thought to what others around you are doing. Nothing makes me cringe more than to see people reinventing the wheel, especially when it means extra work for the group as a whole to integrate with.
Your decisions about which version control system to use, which IDE, etc. (implementation issues) are ultimately extremely low on the totem pole in relation to the overall project management. Just use any one of them properly and you're already 95% of the way there, and the differences between them are small in comparison to the alternative of using nothing.
Lastly, if you are using something like github, google code, or R-forge, you will note something that they all have in common: a suite of tools beyond just a version control system. Namely, you should consider using things like the issue tracking system and the wiki to document progress and log open issues/tasks. The more organized you are with your analysis, the greater the likelihood of success.
我正在使用 git 进行版本控制。我的典型目录结构(例如文章)如下。
大多数目录/文件(ana、doc、org)都处于版本控制之下。当然,大型二进制数据集被排除在版本控制之外(通过 .gitignore)。 README 是一个 Emacs 组织模式文件。
I'm using git for version control. My typical directory structure (e.g. for articles) is as follows.
Most directories/files (ana, doc, org) are under version control. Of course, large binary datasets are excluded from version control (via .gitignore). README is an Emacs org-mode file.
阅读您的更新后,您似乎正在将版本控制系统的选择和使用视为决定存储库和工作流程的结构。在我看来,版本控制更类似于保险单,因为它提供以下服务:
备份。如果某些内容被意外删除或命运的突发奇想烧坏了您的硬盘,您的工作可以从存储库中恢复。使用分布式版本控制,任何事情都会导致您失去工作——在这种情况下,您可能还要担心其他事情。
所有撤消按钮之母。一小时前的分析看起来更好吗?一天前?一周前?版本控制提供了一个倒带按钮,可让您回到过去。
如果您是项目中唯一的人,以上两点可能概述了版本控制系统将如何影响您的工作方式。
版本控制系统的另一方面是,它们通过允许人们在项目材料的独立副本或“分支”上进行实验,然后将任何积极的更改“合并”回主副本来促进协作。它还为项目成员提供了一种方法来密切关注谁的更改影响了哪些文件的哪些行。
例如,我将所有大学课程作业保存在 Subversion 存储库中的版本控制之下。我是唯一一个在这个存储库上工作的人,所以我从不分支或合并源代码——我只是提交并偶尔倒带。倒带我的工作的能力降低了尝试某种新分析的风险——我就是这么做的。如果两个小时后看起来这不是一个好主意,我只需恢复项目文件并尝试不同的东西。
相比之下,我的大部分非课程作业包/程序开发都托管在 git 下。在这种设置中,我经常想在有稳定的主副本可用的情况下在分支上进行试验。在这些情况下,我使用 git 而不是 Subversion,因为 git 使分支和合并成为一项轻松的任务。
重要的一点是,在这两种情况下,我的存储库的结构和我使用的工作流程不是由我的版本控制系统决定的 - 它们是由我决定的。版本控制对我的工作流程的唯一影响是,它使我不必担心尝试新事物,决定我不喜欢它,然后必须撤消所有更改才能回到我开始的地方。因为我使用版本控制,所以我可以遵循 Yogi Berra 的建议:
因为我总是可以回去并走另一条路。
After reading your update, it seems like you are viewing the choice and use of a version control systems as dictating the structure of your repository and workflow. In my opinion, version control is more akin to an insurance policy as it provides the following services:
Backups. If something gets accidentally deleted or the whims of fate fry your hard drive your work can be recovered from the repository. With distributed version control nothing short of the apocalypse can cause you to loose work-- in which case you'll probably have other things to worry about anyway.
The mother of all undo buttons. Was the analysis looking better an hour ago? a day ago? a week ago? Version control provides a rewind button that allows you to travel back in time.
If you are the only person working on a project, the above two points probably outline how version control systems will affect the way you work.
The other side of version control systems is that they foster collaborative efforts by allowing people to experiment on an isolated copy or "branch" of the project material and then "merge" any positive changes back into the master copy. It also provides a means for project members to keep tabs on who's changes affected which lines of which files.
As an example, I keep all of my college coursework under version control in a Subversion repository. I am the only one who works on this repository so I never branch or merge the source-- I just commit and occasionally rewind. The ability to rewind my work reduces the risks of trying some sort of new analysis-- I just do it. If two hours later it looks like it wasn't such a good idea, I just revert the project files and try something different.
In contrast, most all of my non-coursework package/program development is hosted under git. In this sort of a setting I frequently want to experiment on a branch while having a stable master copy available. I use git rather than Subversion in these situations because git makes branching and merging an effortless task.
The important point is that in both of these cases the structure of my repository and the workflow I use are not decided by my version control system-- they are decided by me. The only impact the version control has on my workflow is that it frees me from worrying about trying something new, deciding I don't like it, and then having to undo all the changes to get back to where I started. Because I use version control, I can follow Yogi Berra's advice:
Because I can always go back and take it the other way.
我自己使用 git。本地存储库,存储在与 R 项目相同的目录中。这样,如果我以后删除一个项目,存储库也会随之消失;我可以离线工作;我没有 IRB、FERPA、HIPPA 问题需要处理。
如果我需要额外的备份保证,我可以 git 到远程(安全!)存储库。
-威尔
I use git, myself. Local repositories, stored in the same directory as the R project. That way, if I eliminate a project down the road, the repository goes with it; I can work offline; and I don't have IRB, FERPA, HIPPA issues to deal with.
If I need added backup assurance, I can git to a remote (secured!) repository.
-Wil