为什么 R 中的 apply() 方法比 for 循环慢?

发布于 2024-10-30 01:15:25 字数 1280 浏览 3 评论 0原文

作为最佳实践,我试图确定是否最好创建一个函数并在矩阵中应用它,或者是否最好简单地通过函数循环矩阵。我尝试了两种方法,并惊讶地发现 apply() 速度较慢。任务是获取一个向量并将其评估为正或负,然后返回一个向量,如果为正则返回 1,如果为负则返回 -1。 mash() 函数循环,并且 squish() 函数被传递给 apply() 函数。

million  <- as.matrix(rnorm(100000))

mash <- function(x){
  for(i in 1:NROW(x))
    if(x[i] > 0) {
      x[i] <- 1
    } else {
      x[i] <- -1
    }
    return(x)
}

squish <- function(x){
  if(x >0) {
    return(1)
  } else {
    return(-1)
  }
}


ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm


ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm

loop_million 结果:

user  system elapsed 
0.468   0.008   0.483 

apply_million 结果:

user  system elapsed 
1.401   0.021   1.423 

如果性能满足以下条件,使用 apply() 相对于 for 循环有什么优势退化了?我的测试有缺陷吗?我比较了得到的两个物体以寻找线索,结果发现:

> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"

这只会加深谜团。 apply() 函数无法接受简单的数值向量,这就是为什么我在开始时使用 as.matrix() 对其进行转换。但随后它返回一个数字。 for 循环适用于简单的数值向量。它返回一个与传递给它的类相同的对象。

As a matter of best practices, I'm trying to determine if it's better to create a function and apply() it across a matrix, or if it's better to simply loop a matrix through the function. I tried it both ways and was surprised to find apply() is slower. The task is to take a vector and evaluate it as either being positive or negative and then return a vector with 1 if it's positive and -1 if it's negative. The mash() function loops and the squish() function is passed to the apply() function.

million  <- as.matrix(rnorm(100000))

mash <- function(x){
  for(i in 1:NROW(x))
    if(x[i] > 0) {
      x[i] <- 1
    } else {
      x[i] <- -1
    }
    return(x)
}

squish <- function(x){
  if(x >0) {
    return(1)
  } else {
    return(-1)
  }
}


ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm


ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm

loop_million results:

user  system elapsed 
0.468   0.008   0.483 

apply_million results:

user  system elapsed 
1.401   0.021   1.423 

What is the advantage to using apply() over a for loop if performance is degraded? Is there a flaw in my test? I compared the two resulting objects for a clue and found:

> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"

Which only deepens the mystery. The apply() function cannot accept a simple numeric vector and that's why I cast it with as.matrix() in the beginning. But then it returns a numeric. The for loop is fine with a simple numeric vector. And it returns an object of same class as that one passed to it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

刘备忘录 2024-11-06 01:15:25

apply(和 plyr)函数系列的重点不是速度,而是表现力。它们还倾向于防止错误,因为它们消除了循环所需的簿记代码。

最近,stackoverflow 上的答案过分强调了速度。随着计算机变得更快并且 R 核心优化了 R 的内部结构,您的代码本身也会变得更快。您的代码本身永远不会变得更加优雅或更容易理解。

在这种情况下,您可以两全其美:使用向量化得到一个优雅的答案,而且速度也非常快,(million > 0) * 2 - 1

The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.

Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.

In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast, (million > 0) * 2 - 1.

药祭#氼 2024-11-06 01:15:25

正如蔡斯所说:利用矢量化的力量。您在这里比较两个糟糕的解决方案。

为了澄清为什么您的应用解决方案速度较慢:

在 for 循环中,您实际上使用了矩阵的矢量化索引,这意味着没有进行类型转换。我在这里对此进行了一些粗略的讨论,但基本上内部计算忽略了尺寸。它们只是作为属性保留并与表示矩阵的向量一起返回。举例说明:

> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE

现在,当您使用 apply 时,矩阵在内部被分割为 100,000 个行向量,每个行向量(即单个数字)都通过该函数,最后将结果组合成适当的形式。 apply 函数认为在这种情况下向量是最好的,因此必须连接所有行的结果。这需要时间。

此外,sapply 函数首先使用 as.vector(unlist(...)) 将任何内容转换为向量,最后尝试将答案简化为合适的形式。这也需要时间,因此这里的应用可能会更慢。然而,它不在我的机器上。

IF apply 将是这里的解决方案(但事实并非如此),您可以比较:

> system.time(loop_million <- mash(million))
   user  system elapsed 
   0.75    0.00    0.75    
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
   user  system elapsed 
   0.25    0.00    0.25 
> system.time(sapply2_million <- matrix(sapply(million,squish)))
   user  system elapsed 
   0.34    0.00    0.34 
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE

As Chase said: Use the power of vectorization. You're comparing two bad solutions here.

To clarify why your apply solution is slower:

Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :

> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE

Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.

Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.

IF apply would be a solution here (and it isn't), you could compare :

> system.time(loop_million <- mash(million))
   user  system elapsed 
   0.75    0.00    0.75    
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
   user  system elapsed 
   0.25    0.00    0.25 
> system.time(sapply2_million <- matrix(sapply(million,squish)))
   user  system elapsed 
   0.34    0.00    0.34 
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE
篱下浅笙歌 2024-11-06 01:15:25

如果需要,您可以在向量上使用 lapplysapply。但是,为什么不使用适当的工具来完成这项工作,在本例中是 ifelse()

> ptm <- proc.time()
> ifelse_million <- ifelse(million > 0,1,-1)
> proc.time() - ptm
   user  system elapsed 
  0.077   0.007   0.093 

> all.equal(ifelse_million, loop_million)
[1] TRUE

为了进行比较,以下是使用 for 循环和 sapply 的两个可比较的运行:

> ptm <- proc.time()
> apply_million <- sapply(million, squish)
> proc.time() - ptm
   user  system elapsed 
  0.469   0.004   0.474 
> ptm <- proc.time()
> loop_million <- mash(million)
> proc.time() - ptm
   user  system elapsed 
  0.408   0.001   0.417 

You can use lapply or sapply on vectors if you want. However, why not use the appropriate tool for the job, in this case ifelse()?

> ptm <- proc.time()
> ifelse_million <- ifelse(million > 0,1,-1)
> proc.time() - ptm
   user  system elapsed 
  0.077   0.007   0.093 

> all.equal(ifelse_million, loop_million)
[1] TRUE

And for comparison's sake, here are the two comparable runs using the for loop and sapply:

> ptm <- proc.time()
> apply_million <- sapply(million, squish)
> proc.time() - ptm
   user  system elapsed 
  0.469   0.004   0.474 
> ptm <- proc.time()
> loop_million <- mash(million)
> proc.time() - ptm
   user  system elapsed 
  0.408   0.001   0.417 
挥剑断情 2024-11-06 01:15:25

在这种情况下,进行基于索引的替换比 ifelse()*apply() 系列或循环要快得多:

> million  <- million2 <- as.matrix(rnorm(100000))
> system.time(million3 <- ifelse(million > 0, 1, -1))
   user  system elapsed 
  0.046   0.000   0.044 
> system.time({million2[(want <- million2 > 0)] <- 1; million2[!want] <- -1}) 
   user  system elapsed 
  0.006   0.000   0.007 
> all.equal(million2, million3)
[1] TRUE

非常值得拥有所有这些工具都触手可及。您可以使用对您最有意义的解决方案(因为您需要在数月或数年后理解代码),然后在计算时间变得令人望而却步时开始转向更优化的解决方案。

It is far faster in this case to do index-based replacement than either the ifelse(), the *apply() family, or the loop:

> million  <- million2 <- as.matrix(rnorm(100000))
> system.time(million3 <- ifelse(million > 0, 1, -1))
   user  system elapsed 
  0.046   0.000   0.044 
> system.time({million2[(want <- million2 > 0)] <- 1; million2[!want] <- -1}) 
   user  system elapsed 
  0.006   0.000   0.007 
> all.equal(million2, million3)
[1] TRUE

It is well worth having all these tools at your finger tips. You can use the one that makes the most sense to you (as you need to understand the code months or years later) and then start to move to more optimised solutions if compute time becomes prohibitive.

淡淡の花香 2024-11-06 01:15:25

for 循环速度优势的更好示例。

for_loop <- function(x){
    out <- vector(mode="numeric",length=NROW(x))
    for(i in seq(length(out)))
        out[i] <- max(x[i,])
    return(out)
    }

apply_loop <- function(x){
    apply(x,1,max)
}

million  <- matrix(rnorm(1000000),ncol=10)
> system.time(apply_loop(million))
  user  system elapsed 
  0.57    0.00    0.56 
> system.time(for_loop(million))
  user  system elapsed 
  0.32    0.00    0.33 

编辑

爱德华多建议的版本。

max_col <- function(x){
    x[cbind(seq(NROW(x)),max.col(x))]
}

按行

> system.time(for_loop(million))
   user  system elapsed 
   0.99    0.00    1.11 
> system.time(apply_loop(million))
  user  system elapsed 
   1.40    0.00    1.44 
> system.time(max_col(million))
  user  system elapsed 
  0.06    0.00    0.06 

按列

> system.time(for_loop(t(million)))
  user  system elapsed 
  0.05    0.00    0.05 
> system.time(apply_loop(t(million)))
  user  system elapsed 
  0.07    0.00    0.07 
> system.time(max_col(t(million)))
  user  system elapsed 
  0.04    0.00    0.06 

Better example for speed advantage of for loop.

for_loop <- function(x){
    out <- vector(mode="numeric",length=NROW(x))
    for(i in seq(length(out)))
        out[i] <- max(x[i,])
    return(out)
    }

apply_loop <- function(x){
    apply(x,1,max)
}

million  <- matrix(rnorm(1000000),ncol=10)
> system.time(apply_loop(million))
  user  system elapsed 
  0.57    0.00    0.56 
> system.time(for_loop(million))
  user  system elapsed 
  0.32    0.00    0.33 

EDIT

Version suggested by Eduardo.

max_col <- function(x){
    x[cbind(seq(NROW(x)),max.col(x))]
}

By row

> system.time(for_loop(million))
   user  system elapsed 
   0.99    0.00    1.11 
> system.time(apply_loop(million))
  user  system elapsed 
   1.40    0.00    1.44 
> system.time(max_col(million))
  user  system elapsed 
  0.06    0.00    0.06 

By column

> system.time(for_loop(t(million)))
  user  system elapsed 
  0.05    0.00    0.05 
> system.time(apply_loop(t(million)))
  user  system elapsed 
  0.07    0.00    0.07 
> system.time(max_col(t(million)))
  user  system elapsed 
  0.04    0.00    0.06 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文