在 R 工作流程中编写函数与逐行解释
这里已经写了很多关于在 R 中为统计项目开发工作流程的文章。最流行的工作流程似乎是 Josh Reich 的 LCFD 模型。使用包含代码的 main.R
:
source('load.R')
source('clean.R')
source('func.R')
source('do.R')
以便单个 source('main.R')
运行整个项目。
问:是否有理由更喜欢此工作流程,而不是在 load.R
、clean.R
和 中完成逐行解释工作的工作流程do.R
被 main.R
调用的函数替换?
我现在找不到链接,但我在某处读到,当用 R 编程时,必须克服用函数调用编写所有内容的愿望——R 的意思是逐行编写行解释形式。
问:真的吗?为什么?
我对 LCFD 方法感到沮丧,并且可能会根据函数调用来编写所有内容。但在这样做之前,我想听听 SO 的好心人的意见,看看这是否是一个好主意。
编辑:我现在正在进行的项目是(1)读入一组财务数据,(2)清理它(相当复杂),(3)使用我的估计器估计与数据相关的一些数量(4)使用传统估算器估算相同的数量 (5) 报告结果。我的程序应该以这样的方式编写:(1)针对不同的经验数据集,(2)针对模拟数据,或(3)使用不同的估计器来完成工作。此外,它应该遵循文字编程和可重复的研究指南,以便代码新手可以轻松地运行程序、了解发生了什么以及如何调整它。
Much has been written here about developing a workflow in R for statistical projects. The most popular workflow seems to be Josh Reich's LCFD model. With a main.R
containing code:
source('load.R')
source('clean.R')
source('func.R')
source('do.R')
so that a single source('main.R')
runs the entire project.
Q: Is there a reason to prefer this workflow to one in which the line-by-line interpretive work done in load.R
, clean.R
, and do.R
is replaced by functions which are called by main.R
?
I can't find the link now, but I had read somewhere on SO that when programming in R one must get over their desire to write everything in terms of function calls---that R was MEANT to be written is this line-by-line interpretive form.
Q: Really? Why?
I've been frustrated with the LCFD approach and am going to probably write everything in terms of function calls. But before doing this, I'd like to hear from the good folks of SO as to whether this is a good idea or not.
EDIT: The project I'm working on right now is to (1) read in a set of financial data, (2) clean it (quite involved), (3) Estimate some quantity associated with the data using my estimator (4) Estimate that same quantity using traditional estimators (5) Report results. My programs should be written in such a way that it's a cinch to do the work (1) for different empirical data sets, (2) for simulation data, or (3) using different estimators. ALSO, it should follow literate programming and reproducible research guidelines so that it's simple for a newcomer to the code to run the program, understand what's going on, and how to tweak it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我认为在源文件中创建的任何临时内容都不会被清理。如果我这样做:
并将其作为文件源,x 会一直存在,尽管我不需要它。但如果我这样做:
而不是源,在我的脚本中执行 z=ff(big) , x 矩阵就会超出范围,因此会被清理。
函数实现了简洁的可重用封装,并且不会污染外部。一般来说,它们没有副作用。您的逐行脚本可能使用与当前使用的数据集绑定的全局变量和名称,这使得它们无法重复使用。
我有时会逐行工作,但是一旦我得到超过五行,我就会发现我所拥有的确实需要变成一个适当的可重用函数,而且通常我最终都会重用它。
I think that any temporary stuff created in source'd files won't get cleaned up. If I do:
and source that as a file, x hangs around although I don't need it. But if I do:
and instead of source, do z=ff(big) in my script, the x matrix goes out of scope and so gets cleaned up.
Functions enable neat little re-usable encapsulations and don't pollute outside themselves. In general, they don't have side-effects. Your line-by-line scripts could be using global variables and names tied to the data set in current use, which makes them unre-usable.
I sometimes work line-by-line, but as soon as I get more than about five lines I see that what I have really needs making into a proper reusable function, and more often than not I do end up re-using it.
我认为没有单一的答案。最好的办法是牢记相对优点,然后针对这种情况选择一种方法。
1) 函数。不使用函数的优点是所有变量都保留在工作区中,您可以在最后检查它们。如果您遇到问题,这可能会帮助您弄清楚发生了什么。
另一方面,设计良好的函数的优点是可以对它们进行单元测试。也就是说,您可以将它们与其余代码分开进行测试,从而使它们更易于测试。此外,当您使用某个函数时,对某些较低级别的构造进行取模,您知道一个函数的结果不会影响其他函数,除非它们被传递出去,这可能会限制一个函数的错误处理对另一个函数造成的损害。您可以使用 R 中的
debug
工具来调试您的函数,并且能够单步执行它们是一个优势。2) LCFD。关于是否应该使用 load/clean/func/do 的分解,无论它是通过
source
还是函数完成的,这是第二个问题。无论是通过源代码还是函数完成,这种分解的问题在于您需要运行一个分解才能测试下一个分解,因此您无法真正独立地测试它们。从这个角度来看,它不是理想的结构。另一方面,它确实有一个优点,如果您想在不同的数据上尝试,您可以独立于其他步骤替换加载步骤,并且如果您愿意,可以独立于加载和清理步骤替换其他步骤尝试不同的处理。
3) 没有。文件的数量 您所问的内容中可能隐含着第三个问题:是否所有内容都应该位于一个或多个源文件中。将内容放在不同的源文件中的优点是您不必查看不相关的项目。特别是,如果您有未使用的例程或与您正在查看的当前函数不相关的例程,它们不会中断流程,因为您可以将它们安排在其他文件中。
另一方面,从(a)部署的角度来看,将所有内容都放在一个文件中可能有一个优点,即您可以只向某人发送该单个文件,以及(b)编辑方便,因为您可以将整个程序放在一个文件中。例如,单个编辑器会话有助于搜索,因为您可以使用编辑器的功能搜索整个程序,因为您不必确定例程位于哪个文件中。此外,连续的撤消命令将允许您向后移动所有单元您的程序和一次保存将保存所有模块的当前状态,因为只有一个。 (c) 速度,即,如果您在慢速网络上工作,则将单个文件保留在本地计算机中然后偶尔将其写出可能会更快,而不必在慢速远程设备上来回移动。
注意: 另一件需要考虑的事情是,相对于首先获取文件,使用包可能更能满足您的需求。
I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.
1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.
On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the
debug
facility in R to debug your functions and being able to single step through them is an advantage.2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via
source
or functions is a second question. The problem with this decomposition regardless of whether its done viasource
or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.
3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.
On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.
Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.
没有人提到编写函数时的一个重要考虑因素:除非您一次又一次地重复某些操作,否则编写它们没有多大意义。在分析的某些部分,您将执行一次性操作,因此为它们编写函数没有多大意义。如果您必须重复某件事多次,那么值得投入时间和精力来编写可重用的函数。
No one has mentioned an important consideration when writing functions: there's not much point in writing them unless you're repeating some action again and again. In some parts of an analysis, you'll being doing one-off operations, so there's not much point in writing a function for them. If you have to repeat something more than a few times, it's worth investing the time and effort to write a re-usable function.
工作流程:
我使用非常相似的东西:
到目前为止还没有进行任何分析。这只是为了数据清理和排序。
在 Recodes.r 的末尾,我保存环境以重新加载到我的实际分析中。
清理完成、函数和绘图选项准备就绪后,我开始进行分析。再次,我继续将其分解为较小的文件,这些文件专注于主题或主题,例如:人口统计、客户请求、相关性、对应分析、情节等。我几乎总是自动运行前 5 个命令来设置我的环境,然后逐行运行其他命令以确保准确性和探索。
在每个文件的开头,我加载清理后的数据环境并繁荣。
对象命名法:
我不使用列表,但我确实为我的对象使用命名法。
使用友好的助记符来代替上面的“描述”。
评论
我一直在努力让自己养成使用 comment(x) 的习惯,我发现它非常有用。代码中的注释很有帮助,但通常还不够。
清理
再次,在这里,我总是尝试使用相同的对象来轻松清理。例如 tmp、tmp1、tmp2、tmp3,并确保最后删除它们。
函数
在其他文章中已经有一些评论说,如果您要多次使用某个东西,则只为它编写一个函数。我想对此进行调整,如果您认为有可能再次使用它,则应该将其放入函数中。我什至无法计算我希望为逐行创建的进程编写函数的次数。
另外,在更改函数之前,我再次将其放入名为 Deprecated Functions.r 的文件中,以防止“我到底是怎么做到的”效果。
Workflow:
I use something very similar:
No analysis has been done up to this point. This is just for data cleaning and sorting.
At the end of Recodes.r I save the environment to be reloaded into my actual analysis.
With the cleaning done, functions and plot options ready, I start getting into my analysis. Again, I continue to break it up into smaller files that are focused into topics or themes, like: demographics, client requests, correlations, correspondence analysis, plots, ect. I almost always run the first 5 automatically to get my environment set up and then I run the others on a line by line basis to ensure accuracy and explore.
At the beginning of every file I load the cleaned data environment and prosper.
Object Nomenclature:
I don't use lists, but I do use a nomenclature for my objects.
Using a friendly mnemonic to replace "describe" in the above.
Commenting
I've been trying to get myself into the habit of using comment(x) which I've found incredibly useful. Comments in the code are helpful but oftentimes not enough.
Cleaning Up
Again, here, I always try to use the same object(s) for easy cleanup. tmp, tmp1, tmp2, tmp3 for example and ensuring to remove them at the end.
Functions
There has been some commentary in other posts about only writing a function for something if you're going to use it more than once. I'd like to adjust this to say, if you think there's a possibility that you may EVER use it again, you should throw it into a function. I can't even count the number of times I wished I wrote a function for a process I created on a line by line basis.
Also, BEFORE I change a function, I throw it into a file called Deprecated Functions.r, again, protecting against the "how the hell did I do that" effect.
我经常像这样划分我的代码(尽管我通常将 Load 和 Clean 放在一个文件中),但我从不只获取所有文件来运行整个项目;对我来说,这违背了将它们分开的目的。
就像 Sharpie 的评论一样,我认为你的工作流程应该在很大程度上取决于你正在做的工作类型。我主要从事探索性工作,在这种情况下,将数据输入(加载和清理)与分析(函数和执行)分开,意味着我第二天回来时不必重新加载和重新清理;我可以在清理后保存数据集,然后再次导入。
我几乎没有重复处理日常数据集的经验,但我想我会发现不同的工作流程会有所帮助;正如哈德利回答的那样,如果你只做一次某件事(就像我加载/清理数据时一样),那么编写函数可能没有帮助。但如果你一遍又一遍地这样做(看起来你会这样做),它可能会更有帮助。
简而言之,我发现划分代码有助于探索性分析,但对于重复分析可能会做一些不同的事情,就像您正在考虑的那样。
I often divide up my code similarly to this (though I usually put Load and Clean in one file), but I never just source all the files to run the entire project; to me that defeats the purpose of dividing them up.
Like the comment from Sharpie, I think your workflow should depends a lot on the kind of work you're doing. I do mostly exploratory work, and in that context, keeping the data input (load and clean) separate from the analysis (functions and do), means that I don't have to reload and reclean when I come back the next day; I can instead save the data set after cleaning and then import it again.
I have little experience doing repetitive munging of daily data sets, but I imagine that I would find a different workflow helpful; as Hadley answers, if you're only doing something once (as I am when I load/clean my data), it may not be helpful to write a function. But if you're doing it over and over again (as it seems you would be) it might be much more helpful.
In short, I've found dividing up the code helpful for exploratory analyses, but would probably do something different for repetitive analyses, just like you're thinking about.
一段时间以来,我一直在思考工作流程的权衡。
以下是我对涉及数据分析的任何项目所做的操作:
加载和清理:为项目创建原始数据集的清理版本,就像我正在构建本地关系数据库一样。因此,我尽可能采用 3n 范式构建表格。我执行基本的修改,但在此步骤我不合并或过滤表;再次,我只是为给定的项目创建一个规范化的数据库。我将此步骤放在自己的文件中,最后使用
save
将对象保存到磁盘。函数:我创建了一个函数脚本,其中包含用于数据过滤、合并和聚合任务的函数。这是工作流程中最具智力挑战性的部分,因为我被迫思考如何创建适当的抽象,以便函数可重用。这些函数需要泛化,以便我可以灵活地合并和聚合来自加载和清理步骤的数据。与 LCFD 模型一样,该脚本没有副作用,因为它仅加载函数定义。
函数测试:我创建了一个单独的脚本来测试和优化步骤 2 中定义的函数的性能。我清楚地定义了函数的输出应该是什么,因此这一步作为类型的文档(考虑单元测试)。
Main:我加载在步骤 1 中保存的对象。如果表太大而无法容纳在 RAM 中,我可以使用 SQL 查询过滤表,与数据库思维保持一致。然后,我通过调用步骤 2 中定义的函数来过滤、合并和聚合表。这些表作为参数传递给我定义的函数。函数的输出是适合绘图、建模和分析形式的数据结构。显然,我可能有一些额外的逐行步骤,在这些步骤中创建新函数没有什么意义。
这个工作流程使我能够在 Main.R 步骤中进行闪电般的快速探索。这是因为我构建了清晰的、可概括的、优化的函数。与 LCFD 模型的主要区别在于,我不执行逐行过滤、合并或聚合;我假设我可能想要以不同的方式过滤、合并或聚合数据,作为探索的一部分。此外,我不想用冗长的逐行脚本污染我的全局环境;正如 Spacedman 指出的那样,函数对此有所帮助。
I've been pondering workflow tradeoffs for some time.
Here is what I do for any project involving data analysis:
Load and Clean: Create clean versions of the raw datasets for the project, as if I was building a local relational database. Thus, I structure the tables in 3n normal form where possible. I perform basic munging but I do not merge or filter tables at this step; again, I'm simply creating a normalized database for a given project. I put this step in its own file and I will save the objects to disk at the end using
save
.Functions: I create a function script with functions for data filtering, merging and aggregation tasks. This is the most intellectually challenging part of the workflow as I'm forced to think about how to create proper abstractions so that the functions are reusable. The functions need to generalize so that I can flexibly merge and aggregate data from the load and clean step. As in the LCFD model, this script has no side effects as it only loads function definitions.
Function Tests: I create a separate script to test and optimize the performance of the functions defined in step 2. I clearly define what the output from the functions should be, so this step serves as a kind of documentation (think unit testing).
Main: I load the objects saved in step 1. If the tables are too big to fit in RAM, I can filter the tables with a SQL query, keeping with the database thinking. I then filter, merge and aggregate the tables by calling the functions defined in step 2. The tables are passed as arguments to the functions I defined. The output of the functions are data structures in a form suitable for plotting, modeling and analysis. Obviously, I may have a few extra line by line steps where it makes little sense to create a new function.
This workflow allows me to do lightning fast exploration at the Main.R step. This is because I have built clear, generalizable, and optimized functions. The main difference from the LCFD model is that I do not preform line-by-line filtering, merging or aggregating; I assume that I may want to filter, merge, or aggregate the data in different ways as part of exploration. Additionally, I don't want to pollute my global environment with lengthy line-by-line script; as Spacedman points out, functions help with this.