为什么 R 中的 apply() 方法比 for 循环慢?
作为最佳实践,我试图确定是否最好创建一个函数并在矩阵中应用它,或者是否最好简单地通过函数循环矩阵。我尝试了两种方法,并惊讶地发现 apply()
速度较慢。任务是获取一个向量并将其评估为正或负,然后返回一个向量,如果为正则返回 1,如果为负则返回 -1。 mash()
函数循环,并且 squish()
函数被传递给 apply()
函数。
million <- as.matrix(rnorm(100000))
mash <- function(x){
for(i in 1:NROW(x))
if(x[i] > 0) {
x[i] <- 1
} else {
x[i] <- -1
}
return(x)
}
squish <- function(x){
if(x >0) {
return(1)
} else {
return(-1)
}
}
ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm
ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm
loop_million
结果:
user system elapsed
0.468 0.008 0.483
apply_million
结果:
user system elapsed
1.401 0.021 1.423
如果性能满足以下条件,使用 apply()
相对于 for
循环有什么优势退化了?我的测试有缺陷吗?我比较了得到的两个物体以寻找线索,结果发现:
> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"
这只会加深谜团。 apply()
函数无法接受简单的数值向量,这就是为什么我在开始时使用 as.matrix()
对其进行转换。但随后它返回一个数字。 for
循环适用于简单的数值向量。它返回一个与传递给它的类相同的对象。
As a matter of best practices, I'm trying to determine if it's better to create a function and apply()
it across a matrix, or if it's better to simply loop a matrix through the function. I tried it both ways and was surprised to find apply()
is slower. The task is to take a vector and evaluate it as either being positive or negative and then return a vector with 1 if it's positive and -1 if it's negative. The mash()
function loops and the squish()
function is passed to the apply()
function.
million <- as.matrix(rnorm(100000))
mash <- function(x){
for(i in 1:NROW(x))
if(x[i] > 0) {
x[i] <- 1
} else {
x[i] <- -1
}
return(x)
}
squish <- function(x){
if(x >0) {
return(1)
} else {
return(-1)
}
}
ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm
ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm
loop_million
results:
user system elapsed
0.468 0.008 0.483
apply_million
results:
user system elapsed
1.401 0.021 1.423
What is the advantage to using apply()
over a for
loop if performance is degraded? Is there a flaw in my test? I compared the two resulting objects for a clue and found:
> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"
Which only deepens the mystery. The apply()
function cannot accept a simple numeric vector and that's why I cast it with as.matrix()
in the beginning. But then it returns a numeric. The for
loop is fine with a simple numeric vector. And it returns an object of same class as that one passed to it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
apply(和 plyr)函数系列的重点不是速度,而是表现力。它们还倾向于防止错误,因为它们消除了循环所需的簿记代码。
最近,stackoverflow 上的答案过分强调了速度。随着计算机变得更快并且 R 核心优化了 R 的内部结构,您的代码本身也会变得更快。您的代码本身永远不会变得更加优雅或更容易理解。
在这种情况下,您可以两全其美:使用向量化得到一个优雅的答案,而且速度也非常快,
(million > 0) * 2 - 1
。The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.
Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.
In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast,
(million > 0) * 2 - 1
.正如蔡斯所说:利用矢量化的力量。您在这里比较两个糟糕的解决方案。
为了澄清为什么您的应用解决方案速度较慢:
在 for 循环中,您实际上使用了矩阵的矢量化索引,这意味着没有进行类型转换。我在这里对此进行了一些粗略的讨论,但基本上内部计算忽略了尺寸。它们只是作为属性保留并与表示矩阵的向量一起返回。举例说明:
现在,当您使用 apply 时,矩阵在内部被分割为 100,000 个行向量,每个行向量(即单个数字)都通过该函数,最后将结果组合成适当的形式。 apply 函数认为在这种情况下向量是最好的,因此必须连接所有行的结果。这需要时间。
此外,sapply 函数首先使用
as.vector(unlist(...))
将任何内容转换为向量,最后尝试将答案简化为合适的形式。这也需要时间,因此这里的应用可能会更慢。然而,它不在我的机器上。IF apply 将是这里的解决方案(但事实并非如此),您可以比较:
As Chase said: Use the power of vectorization. You're comparing two bad solutions here.
To clarify why your apply solution is slower:
Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :
Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.
Also the sapply function first uses
as.vector(unlist(...))
to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.IF apply would be a solution here (and it isn't), you could compare :
如果需要,您可以在向量上使用
lapply
或sapply
。但是,为什么不使用适当的工具来完成这项工作,在本例中是ifelse()
?为了进行比较,以下是使用 for 循环和 sapply 的两个可比较的运行:
You can use
lapply
orsapply
on vectors if you want. However, why not use the appropriate tool for the job, in this caseifelse()
?And for comparison's sake, here are the two comparable runs using the for loop and sapply:
在这种情况下,进行基于索引的替换比
ifelse()
、*apply()
系列或循环要快得多:非常值得拥有所有这些工具都触手可及。您可以使用对您最有意义的解决方案(因为您需要在数月或数年后理解代码),然后在计算时间变得令人望而却步时开始转向更优化的解决方案。
It is far faster in this case to do index-based replacement than either the
ifelse()
, the*apply()
family, or the loop:It is well worth having all these tools at your finger tips. You can use the one that makes the most sense to you (as you need to understand the code months or years later) and then start to move to more optimised solutions if compute time becomes prohibitive.
for 循环速度优势的更好示例。
编辑
爱德华多建议的版本。
按行
按列
Better example for speed advantage of for loop.
EDIT
Version suggested by Eduardo.
By row
By column