R 中的编码实践:不同风格的优点和缺点是什么?
最近关于使用 require 与 :: 的问题引发了一个问题:在 R 中编程时使用哪些编程风格,以及它们的优点/缺点是什么。浏览源代码或者在网上浏览,你会看到很多不同的样式显示。
我的代码中的主要趋势:
重向量化我经常使用索引(和嵌套索引),这有时会导致代码相当晦涩,但通常比其他解决方案快得多。 例如:
x[x < 5] <- 0
而不是x <- ifelse(x < 5, x, 0)
我倾向于嵌套函数避免因需要清理的临时对象而导致内存超载。特别是对于操作大型数据集的函数来说,这可能是一个真正的负担。例如: y <- cbind(x,as.numeric(factor(x))) 而不是 y <- as.numeric(factor(x)) ; z <- cbind(x,y)
我编写了很多自定义函数,即使我只在例如中使用了一次代码。一个
sapply
。我相信它可以使其更具可读性,而无需创建可以保留的对象。我不惜一切代价避免循环,因为我认为矢量化更加干净(而且更快)
然而,我注意到对此有不同的看法,有些人倾向于放弃他们所谓的“Perl”编程方式(甚至是“Lisp”,使用我的代码中所有这些括号都在飞来飞去,但我不会走那么远)。
您认为 R 中良好的编码实践是什么?
您的编程风格是什么?您如何看待它的优点和缺点?
The recent questions regarding the use of require versus :: raised the question about which programming styles are used when programming in R, and what their advantages/disadvantages are. Browsing through the source code or browsing on the net, you see a lot of different styles displayed.
The main trends in my code :
heavy vectorization I play a lot with the indices (and nested indices), which results in rather obscure code sometimes but is generally a lot faster than other solutions.
eg:x[x < 5] <- 0
instead ofx <- ifelse(x < 5, x, 0)
I tend to nest functions to avoid overloading the memory with temporary objects that I need to clean up. Especially with functions manipulating large datasets this can be a real burden. eg :
y <- cbind(x,as.numeric(factor(x)))
instead ofy <- as.numeric(factor(x)) ; z <- cbind(x,y)
I write a lot of custom functions, even if I use the code only once in eg. an
sapply
. I believe it keeps it more readible without creating objects that can remain lying around.I avoid loops at all costs, as I consider vectorization to be a lot cleaner (and faster)
Yet, I've noticed that opinions on this differ, and some people tend to back away from what they would call my "Perl" way of programming (or even "Lisp", with all those brackets flying around in my code. I wouldn't go that far though).
What do you consider good coding practice in R?
What is your programming style, and how do you see its advantages and disadvantages?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我做什么取决于我编写代码的原因。如果我正在为我的研究(日常工作)编写数据分析脚本,我想要的东西既能工作,又能在几个月甚至几年后可读和理解。我不太关心计算时间。使用
lapply
等进行矢量化。可能会导致混乱,我想避免这种情况。在这种情况下,例如,如果 lapply 让我跳过障碍来构造适当的匿名函数,我将使用循环进行重复过程。我会在第一个项目符号中使用 ifelse() ,因为至少在我看来,该调用的意图比子集+替换版本更容易理解。在我的数据分析中,我更关心的是让事情正确,而不是计算时间——总有周末和晚上我不在办公室的时候,我可以处理大型工作。
对于你的其他子弹;我倾向于不内联/嵌套调用,除非它们非常微不足道。如果我明确地阐明这些步骤,我会发现代码更易于阅读,因此不太可能包含错误。
我一直在编写自定义函数,特别是当我要在循环或类似情况中重复调用与该函数等效的代码时。这样,我将主数据分析脚本中的代码封装到它自己的
.R
文件中,这有助于将分析的意图与分析的完成方式分开。如果该函数有用,我会将其用于其他项目等。如果我正在为包编写代码,我可能会以与数据分析(熟悉)相同的态度开始,以获得我知道有效的东西,然后才开始如果我想提高计算时间,可以进行优化。
我试图避免做的一件事是,无论我编码的目的如何,当我编码时太聪明。归根结底,我有时并不像我想象的那么聪明,如果我让事情变得简单,我就不会像我试图表现得聪明时那样经常摔倒。
What I do will depend on why I am writing the code. If I am writing a data analysis script for my research (day job), I want something that works but that is readable and understandable months or even years later. I don't care too much about compute times. Vectorizing with
lapply
et al. can lead to obfuscation, which I would like to avoid.In such cases, I would use loops for a repetitive process if
lapply
made me jump through hoops to construct the appropriate anonymous function for example. I would use theifelse()
in your first bullet because, to my mind at least, the intention of that call is easier to comprehend than the subset+replacement version. With my data analysis I am more concerned with getting things correct than necessarily with compute time --- there are always the weekends and nights when I'm not in the office when I can run big jobs.For your other bullets; I would tend not to inline/nest calls unless they were very trivial. If I spell out the steps explicitly, I find the code easier to read and therefore less likely to contain bugs.
I write custom functions all the time, especially if I am going to be calling the code equivalent of the function repeatedly in a loop or similar. That way I have encapsulated the code out of the main data analysis script into it's own
.R
file which helps keep the intention of the analysis separate from how the analysis is done. And if the function is useful I have it for use in other projects etc.If I am writing code for a package, I might start with the same attitude as my data analysis (familiarity) to get something I know works, and only then go for the optimisation if I want to improve compute times.
The one thing I try to avoid doing, is being too clever when I code, whatever I am coding for. Ultimately I am never as clever as I think I am at times and if I keep things simple, I tend not to fall on my face as often as I might if I were trying to be clever.
我为概念上做一件事的各种代码块编写函数(在独立的
.R
文件中)。这让事情变得简短而甜蜜。我发现调试更容易一些,因为traceback()
可以告诉您哪个函数产生了错误。我也倾向于避免循环,除非绝对必要。如果我使用
for()
循环,我会觉得有点脏。 :) 我非常努力地尝试矢量化或使用 apply 系列来完成所有事情。这并不总是最佳实践,特别是当您需要向另一个不熟悉应用或矢量化的人解释代码时。关于
require
与::
的使用,我倾向于同时使用两者。如果我只需要某个包中的一个函数,我会通过::
使用它,但如果我需要多个函数,我会加载整个包。如果包之间的函数名称存在冲突,我会尝试记住并使用::
。我尝试为我想要实现的每项任务找到一个函数。我相信在我之前有人已经想到了这一点,并制作了一个比我能想到的任何功能都更好的功能。这有时有效,有时则不太有效。
我尝试编写代码以便我能够理解它。这意味着我会进行大量评论并构建大量代码,以便它们以某种方式遵循我想要实现的目标的想法。我经常随着函数的进行而覆盖对象。我认为这可以保持任务的透明度,特别是当您稍后在函数中引用这些对象时。当计算时间超过我的耐心时,我会考虑速度。如果一个功能需要很长时间才能完成,以至于我开始浏览,我会看看是否可以改进它。
我发现一个具有代码折叠和语法着色功能的良好语法编辑器(我使用 Eclipse + StatET)让我省去了很多麻烦。
根据 VitoshKa 的帖子,我补充说,我使用大写单词(sensu Java)作为函数名称,使用 fullstop.delimited 作为变量。我发现我可以有另一种函数参数样式。
I write functions (in standalone
.R
files) for various chunks of code that conceptually do one thing. This keeps things short and sweet. I found debugging somewhat easier, becausetraceback()
gives you which function produced an error.I too tend to avoid loops, except when its absolutely necessary. I feel somewhat dirty if I use a
for()
loop. :) I try really hard to do everything vectorized or with the apply family. This is not always the best practice, especially if you need to explain the code to another person who is not as fluent in apply or vectorization.Regarding the use of
require
vs::
, I tend to use both. If I only need one function from a certain package I use it via::
, but if I need several functions, I load the entire package. If there's a conflict in function names between packages, I try to remember and use::
.I try to find a function for every task I'm trying to achieve. I believe someone before me has thought of it and made a function that works better than anything I can come up with. This sometimes works, sometimes not so much.
I try to write my code so that I can understand it. This means I comment a lot and construct chunks of code so that they somehow follow the idea of what I'm trying to achieve. I often overwrite objects as the function progresses. I think this keeps the transparency of the task, especially if you're referring to these objects later in the function. I think about speed when computing time exceeds my patience. If a function takes so long to finish that I start browsing SO, I see if I can improve it.
I found out that a good syntax editor with code folding and syntax coloring (I use Eclipse + StatET) has saved me a lot of headaches.
Based on VitoshKa's post, I am adding that I use capitalizedWords (sensu Java) for function names and fullstop.delimited for variables. I see that I could have another style for function arguments.
命名约定对于代码的可读性极其重要。受 R 的 S4 内部风格的启发,我在这里使用了:
Naming conventions are extremely important for the readability of the code. Inspired by R's S4 internal style here is what I use:
对于数据处理,我尝试使用尽可能多的 SQL,至少对于像 GROUP BY 平均值这样的基本操作是这样。我非常喜欢 R,但有时意识到你的研究策略不够好,无法找到隐藏在另一个包中的另一个函数,这不仅很有趣。对于我的情况,SQL 方言没有太大区别,并且代码非常透明。大多数时候,阈值(何时开始使用 R 语法)是相当直观的发现。例如
,我认为这是一个很好的做法,并且强烈建议在大多数用例中使用 SQL 数据库来存储数据。我也在研究 TSdbi 并在关系数据库中保存时间序列,但还不能真正判断。
For data juggling I try to use as much SQL as possible, at least for the basic things like GROUP BY averages. I like R a lot but sometimes it's not only fun to realize that your research strategy was not good enough to find yet another function hidden in yet another package. For my cases SQL dialects do not differ much and the code is really transparent. Most of the time the threshold (when to start to use R syntax) is rather intuitive to discover. e.g.
So I consider it good practice and really recommend to use a SQL database for your data for most use cases. I am also looking into TSdbi and saving time series in relational database, but cannot really judge that yet.