“应用”的优势是什么？功能？什么时候它们比“for”更好用？循环，什么时候不是？

发布于 2024-11-17 01:32:56 字数 522 浏览 3 评论 0原文

可能的重复：
R 的 apply 系列是否比语法糖更重要

正如标题所示说。也许是个愚蠢的问题，但我的理解是，当使用“apply”函数时，迭代是在编译的代码中执行的，而不是在 R 解析器中执行的。这似乎意味着，例如，如果存在大量迭代并且每个操作都相对简单，则 lapply 仅比“for”循环更快。例如，如果对包含在 lapply 中的函数进行一次调用需要 10 秒，并且只有 12 次迭代，我可以想象使用“for”和“lapply”之间几乎没有任何区别。

现在我想起来了，如果无论如何都必须解析“lapply”内的函数，为什么使用“lapply”而不是“for”会有任何性能优势，除非你正在做一些有编译函数的事情（如求和或乘法等）？

提前致谢！

乔什

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吾性傲以野 2024-11-24 01:32:56

有多种原因导致人们可能更喜欢 apply 系列函数而不是 for 循环，反之亦然。

首先，如果正确执行，for() 和 apply()、sapply() 通常会一样快。 lapply() 与其他函数相比，lapply() 在 R 内部的编译代码中执行的操作更多，因此比这些函数更快。当“循环”数据的行为占计算时间的很大一部分时，速度优势似乎最大；在许多日常使用中，您不太可能从本质上更快的 lapply() 中获益。最后，这些都将调用 R 函数，因此需要解释它们然后运行。

for() 循环通常更容易实现，特别是如果您来自循环流行的编程背景。在循环中工作可能比强制将迭代计算放入 apply 系列函数之一中更自然。但是，要正确使用 for() 循环，您需要做一些额外的工作来设置存储并管理将循环的输出重新插入在一起。 apply 函数会自动为您完成此操作。例如：

IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
    OUT[i] <- IN > 0.5
}

这是一个愚蠢的例子，因为>是一个矢量化运算符，但我想要强调一点，即你必须管理输出。最主要的是，对于 for() 循环，您始终在开始循环之前分配足够的存储空间来保存输出。如果您不知道需要多少存储空间，则分配合理的存储块，然后在循环中检查是否已耗尽该存储空间，然后添加另一大存储块。

在我看来，使用 apply 系列函数之一的主要原因是为了更优雅、更易读的代码。我们可以让 R 处理该问题，并简洁地要求 R 对数据子集运行函数，而不是管理输出存储和设置循环（如上所示）。速度通常不会影响决定，至少对我来说是这样。我使用最适合情况的函数，并且会生成简单、易于理解的代码，因为如果我不记得代码是什么，我很可能会浪费比总是选择最快的函数节省的时间更多的时间一天或一周或更长时间后做！

apply 系列适合标量或向量运算。 for() 循环通常适合使用同一索引 i 执行多个迭代操作。例如，我编写了使用 for() 循环对对象进行 k 折叠或引导交叉验证的代码。我可能永远不会考虑使用 apply 系列之一来执行此操作，因为每个 CV 迭代都需要多个操作、访问当前帧中的大量对象，并填充几个保存输出的输出对象。迭代。

至于最后一点，关于为什么 lapply() 可能比 for() 或 apply() 更快，您需要认识到这一点“循环”可以在解释的 R 代码或编译的代码中执行。是的，两者仍然会调用需要解释的 R 函数，但如果您直接从编译的 C 代码（例如 lapply()）进行循环和调用，那么这就是性能增益的地方来自 apply() ，可以归结为实际 R 代码中的 for() 循环。请参阅 apply() 的源代码，了解它是 for() 循环的包装器，然后查看 lapply() 的代码code>，即：

> lapply
function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<environment: namespace:base>

您应该明白为什么 lapply() 和 for() 以及其他 apply 之间的速度存在差异> 家庭功能。 .Internal() 是 R 调用 R 本身使用的编译 C 代码的方法之一。除了对 FUN 进行操作和健全性检查之外，整个计算都是用 C 语言完成的，调用 R 函数 FUN。将其与 apply() 的源代码进行比较。

There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.

Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.

for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:

IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
    OUT[i] <- IN > 0.5
}

that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.

The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!

The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.

As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:

> lapply
function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<environment: namespace:base>

and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().

回复收藏 0 原文