R 的应用家族不仅仅是语法糖吗?
...关于执行时间和/或内存。
如果这不是真的,请用代码片段证明这一点。请注意,矢量化带来的加速不算在内。加速必须来自 apply
(tapply
, sapply
, ...) 本身。
...regarding execution time and / or memory.
If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must come from apply
(tapply
, sapply
, ...) itself.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
R 中的
apply
函数并未比其他循环函数(例如for
)提供更高的性能。一个例外是lapply
,它可能会更快一点,因为它在 C 代码中比在 R 中完成更多工作(请参阅 这个问题作为一个例子)。但总的来说,规则是为了清晰起见,您应该使用 apply 函数,而不是为了性能。
我想补充一点,应用函数没有副作用,这是使用 R 进行函数式编程时的一个重要区别。可以使用
assign
或<<-
,但这可能非常危险。副作用还会使程序更难理解,因为变量的状态取决于历史记录。编辑:
只是为了通过一个递归计算斐波那契数列的简单示例来强调这一点;这可以运行多次以获得准确的测量结果,但关键是没有一种方法具有显着不同的性能:
编辑 2
关于 R 并行包的使用(例如 rpvm、rmpi、雪),它们通常提供
apply
系列函数(即使是foreach
包本质上是等效的,尽管名称如此)。下面是snow
中sapply
函数的一个简单示例:该示例使用了socket集群,不需要安装额外的软件;否则,您将需要 PVM 或 MPI 之类的东西(请参阅 Tierney 的集群页面< /a>)。
snow
具有以下 apply 函数:apply
函数应用于并行执行是有意义的,因为它们没有 副作用。当您在for
循环中更改变量值时,它会被全局设置。另一方面,所有apply
函数都可以安全地并行使用,因为更改是函数调用本地的(除非您尝试使用assign
或<< ;-
,在这种情况下您可能会引入副作用)。不用说,小心局部变量和全局变量至关重要,尤其是在处理并行执行时。编辑:
这是一个简单的示例,用于演示
for
和*apply
就副作用而言的区别:请注意
父环境中的 df
会被for
更改,但不会被*apply
更改。The
apply
functions in R don't provide improved performance over other looping functions (e.g.for
). One exception to this islapply
which can be a little faster because it does more work in C code than in R (see this question for an example of this).But in general, the rule is that you should use an apply function for clarity, not for performance.
I would add to this that apply functions have no side effects, which is an important distinction when it comes to functional programming with R. This can be overridden by using
assign
or<<-
, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.Edit:
Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:
Edit 2:
Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide
apply
family functions (even theforeach
package is essentially equivalent, despite the name). Here's a simple example of thesapply
function insnow
:This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page).
snow
has the following apply functions:It makes sense that
apply
functions should be used for parallel execution since they have no side effects. When you change a variable value within afor
loop, it is globally set. On the other hand, allapply
functions can safely be used in parallel because changes are local to the function call (unless you try to useassign
or<<-
, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.Edit:
Here's a trivial example to demonstrate the difference between
for
and*apply
so far as side effects are concerned:Note how the
df
in the parent environment is altered byfor
but not*apply
.有时加速可能会很大,例如当您必须嵌套 for 循环才能根据多个因素的分组获得平均值时。这里有两种方法可以给你完全相同的结果:
两者都给出完全相同的结果,都是一个带有平均值和命名行和列的 5 x 10 矩阵。但是:
就这样吧。我赢了什么? ;-)
Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :
Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :
There you go. What did I win? ;-)
...正如我刚刚在其他地方写的,vapply 是你的朋友!
...它就像 sapply,但您还指定了返回值类型,这使得它更快。
2020 年 1 月 1 日更新:
...and as I just wrote elsewhere, vapply is your friend!
...it's like sapply, but you also specify the return value type which makes it much faster.
Jan. 1, 2020 update:
我在其他地方写过,像 Shane 这样的示例并没有真正强调各种循环语法之间的性能差异,因为时间全部花费在函数内,而不是真正强调循环。此外,该代码不公平地将没有内存的 for 循环与返回值的 apply 系列函数进行比较。这是一个稍微不同的例子,强调了这一点。
如果您打算保存结果,那么应用族函数可能比语法糖要更多。
(z 的简单 unlist 只有 0.2 秒,因此 lapply 速度要快得多。在 for 循环中初始化 z 非常快,因为我给出了 6 次运行中最后 5 次的平均值,因此将其移到系统之外。时间会但
还要注意的一件事是,使用应用族函数还有另一个原因,与它们的性能、清晰度或缺乏副作用无关。
for
循环通常会促进在循环中放入尽可能多的内容。这是因为每个循环都需要设置变量来存储信息(以及其他可能的操作)。 apply 语句往往会产生相反的偏见。通常,您希望对数据执行多项操作,其中一些操作可以矢量化,但有些操作可能无法矢量化。在 R 中,与其他语言不同,最好将这些操作分开,并运行那些在 apply 语句(或函数的矢量化版本)中未矢量化的操作以及作为真正矢量操作进行矢量化的操作。这通常会极大地提高性能。以 Joris Meys 为例,他用方便的 R 函数替换了传统的 for 循环,我们可以使用它来展示以更 R 友好的方式编写代码的效率,从而在无需专门函数的情况下实现类似的加速。
这最终比
for
循环快得多,只比内置优化的tapply
函数慢一点。这并不是因为vapply
比for
快得多,而是因为它在循环的每次迭代中只执行一个操作。在此代码中,其他所有内容均已矢量化。在 Joris Meys 的传统for
循环中,每次迭代都会发生许多(7?)操作,并且需要进行大量设置才能执行。另请注意,这比for
版本紧凑得多。I've written elsewhere that an example like Shane's doesn't really stress the difference in performance among the various kinds of looping syntax because the time is all spent within the function rather than actually stressing the loop. Furthermore, the code unfairly compares a for loop with no memory with apply family functions that return a value. Here's a slightly different example that emphasizes the point.
If you plan to save the result then apply family functions can be much more than syntactic sugar.
(the simple unlist of z is only 0.2s so the lapply is much faster. Initializing the z in the for loop is quite fast because I'm giving the average of the last 5 of 6 runs so moving that outside the system.time would hardly affect things)
One more thing to note though is that there is another reason to use apply family functions independent of their performance, clarity, or lack of side effects. A
for
loop typically promotes putting as much as possible within the loop. This is because each loop requires setup of variables to store information (among other possible operations). Apply statements tend to be biased the other way. Often times you want to perform multiple operations on your data, several of which can be vectorized but some might not be able to be. In R, unlike other languages, it is best to separate those operations out and run the ones that are not vectorized in an apply statement (or vectorized version of the function) and the ones that are vectorized as true vector operations. This often speeds up performance tremendously.Taking Joris Meys example where he replaces a traditional for loop with a handy R function we can use it to show the efficiency of writing code in a more R friendly manner for a similar speedup without the specialized function.
This winds up being much faster than the
for
loop and just a little slower than the built in optimizedtapply
function. It's not becausevapply
is so much faster thanfor
but because it is only performing one operation in each iteration of the loop. In this code everything else is vectorized. In Joris Meys traditionalfor
loop many (7?) operations are occurring in each iteration and there's quite a bit of setup just for it to execute. Note also how much more compact this is than thefor
version.当对向量的子集应用函数时,tapply 比 for 循环要快得多。示例:
apply
,但是,在大多数情况下不会提供任何速度提升,并且在某些情况下甚至可能慢很多:但是对于这些情况,我们有
colSums
和rowSums:When applying functions over subsets of a vector,
tapply
can be pretty faster than a for loop. Example:apply
, however, in most situation doesn't provide any speed increase, and in some cases can be even lot slower:But for these situations we've got
colSums
androwSums
: