当前位置：文江博客话题详情

R 的应用家族不仅仅是语法糖吗？

发布于 2024-08-21 23:26:53 字数 145 浏览 5 评论 0原文

...关于执行时间和/或内存。

如果这不是真的，请用代码片段证明这一点。请注意，矢量化带来的加速不算在内。加速必须来自 apply (tapply, sapply, ...) 本身。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

〃温暖了心ぐ 2024-08-28 23:26:53

R 中的apply 函数并未比其他循环函数（例如for）提供更高的性能。一个例外是 lapply ，它可能会更快一点，因为它在 C 代码中比在 R 中完成更多工作（请参阅这个问题作为一个例子）。

但总的来说，规则是为了清晰起见，您应该使用 apply 函数，而不是为了性能。

我想补充一点，应用函数没有副作用，这是使用 R 进行函数式编程时的一个重要区别。可以使用 assign 或 <<-，但这可能非常危险。副作用还会使程序更难理解，因为变量的状态取决于历史记录。

编辑：

只是为了通过一个递归计算斐波那契数列的简单示例来强调这一点；这可以运行多次以获得准确的测量结果，但关键是没有一种方法具有显着不同的性能：

fibo <- function(n) {
  if ( n < 2 ) n
  else fibo(n-1) + fibo(n-2)
}
system.time(for(i in 0:26) fibo(i))
# user  system elapsed 
# 7.48    0.00    7.52 
system.time(sapply(0:26, fibo))
# user  system elapsed 
# 7.50    0.00    7.54 
system.time(lapply(0:26, fibo))
# user  system elapsed 
# 7.48    0.04    7.54 
library(plyr)
system.time(ldply(0:26, fibo))
# user  system elapsed 
# 7.52    0.00    7.58

编辑 2

关于 R 并行包的使用（例如 rpvm、rmpi、雪），它们通常提供 apply 系列函数（即使是 foreach 包本质上是等效的，尽管名称如此）。下面是snow中sapply函数的一个简单示例：

library(snow)
cl <- makeSOCKcluster(c("localhost","localhost"))
parSapply(cl, 1:20, get("+"), 3)

该示例使用了socket集群，不需要安装额外的软件；否则，您将需要 PVM 或 MPI 之类的东西（请参阅 Tierney 的集群页面< /a>）。 snow 具有以下 apply 函数：

parLapply(cl, x, fun, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
parApply(cl, X, MARGIN, FUN, ...)
parRapply(cl, x, fun, ...)
parCapply(cl, x, fun, ...)

apply 函数应用于并行执行是有意义的，因为它们没有 副作用。当您在 for 循环中更改变量值时，它会被全局设置。另一方面，所有 apply 函数都可以安全地并行使用，因为更改是函数调用本地的（除非您尝试使用 assign 或 << ;-，在这种情况下您可能会引入副作用）。不用说，小心局部变量和全局变量至关重要，尤其是在处理并行执行时。

编辑：

这是一个简单的示例，用于演示 for 和 *apply 就副作用而言的区别：

df <- 1:10
# *apply example
lapply(2:3, function(i) df <- df * i)
df
# [1]  1  2  3  4  5  6  7  8  9 10
# for loop example
for(i in 2:3) df <- df * i
df
# [1]  6 12 18 24 30 36 42 48 54 60

请注意 父环境中的 df 会被 for 更改，但不会被 *apply 更改。

The apply functions in R don't provide improved performance over other looping functions (e.g. for). One exception to this is lapply which can be a little faster because it does more work in C code than in R (see this question for an example of this).

But in general, the rule is that you should use an apply function for clarity, not for performance.

I would add to this that apply functions have no side effects, which is an important distinction when it comes to functional programming with R. This can be overridden by using assign or <<-, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.

Edit:

Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:

fibo <- function(n) {
  if ( n < 2 ) n
  else fibo(n-1) + fibo(n-2)
}
system.time(for(i in 0:26) fibo(i))
# user  system elapsed 
# 7.48    0.00    7.52 
system.time(sapply(0:26, fibo))
# user  system elapsed 
# 7.50    0.00    7.54 
system.time(lapply(0:26, fibo))
# user  system elapsed 
# 7.48    0.04    7.54 
library(plyr)
system.time(ldply(0:26, fibo))
# user  system elapsed 
# 7.52    0.00    7.58

Edit 2:

Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide apply family functions (even the foreach package is essentially equivalent, despite the name). Here's a simple example of the sapply function in snow:

library(snow)
cl <- makeSOCKcluster(c("localhost","localhost"))
parSapply(cl, 1:20, get("+"), 3)

This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page). snow has the following apply functions:

parLapply(cl, x, fun, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
parApply(cl, X, MARGIN, FUN, ...)
parRapply(cl, x, fun, ...)
parCapply(cl, x, fun, ...)

It makes sense that apply functions should be used for parallel execution since they have no side effects. When you change a variable value within a for loop, it is globally set. On the other hand, all apply functions can safely be used in parallel because changes are local to the function call (unless you try to use assign or <<-, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.

Edit:

Here's a trivial example to demonstrate the difference between for and *apply so far as side effects are concerned:

df <- 1:10
# *apply example
lapply(2:3, function(i) df <- df * i)
df
# [1]  1  2  3  4  5  6  7  8  9 10
# for loop example
for(i in 2:3) df <- df * i
df
# [1]  6 12 18 24 30 36 42 48 54 60

Note how the df in the parent environment is altered by for but not *apply.

回复收藏 0 原文

挽袖吟 2024-08-28 23:26:53

有时加速可能会很大，例如当您必须嵌套 for 循环才能根据多个因素的分组获得平均值时。这里有两种方法可以给你完全相同的结果：

set.seed(1)  #for reproducability of the results

# The data
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# the function forloop that averages X over every combination of Y and Z
forloop <- function(x,y,z){
# These ones are for optimization, so the functions 
#levels() and length() don't have to be called more than once.
  ylev <- levels(y)
  zlev <- levels(z)
  n <- length(ylev)
  p <- length(zlev)

  out <- matrix(NA,ncol=p,nrow=n)
  for(i in 1:n){
      for(j in 1:p){
          out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
      }
  }
  rownames(out) <- ylev
  colnames(out) <- zlev
  return(out)
}

# Used on the generated data
forloop(X,Y,Z)

# The same using tapply
tapply(X,list(Y,Z),mean)

两者都给出完全相同的结果，都是一个带有平均值和命名行和列的 5 x 10 矩阵。但是：

> system.time(forloop(X,Y,Z))
   user  system elapsed 
   0.94    0.02    0.95 

> system.time(tapply(X,list(Y,Z),mean))
   user  system elapsed 
   0.06    0.00    0.06

就这样吧。我赢了什么？ ;-)

Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :

set.seed(1)  #for reproducability of the results

# The data
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# the function forloop that averages X over every combination of Y and Z
forloop <- function(x,y,z){
# These ones are for optimization, so the functions 
#levels() and length() don't have to be called more than once.
  ylev <- levels(y)
  zlev <- levels(z)
  n <- length(ylev)
  p <- length(zlev)

  out <- matrix(NA,ncol=p,nrow=n)
  for(i in 1:n){
      for(j in 1:p){
          out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
      }
  }
  rownames(out) <- ylev
  colnames(out) <- zlev
  return(out)
}

# Used on the generated data
forloop(X,Y,Z)

# The same using tapply
tapply(X,list(Y,Z),mean)

Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :

> system.time(forloop(X,Y,Z))
   user  system elapsed 
   0.94    0.02    0.95 

> system.time(tapply(X,list(Y,Z),mean))
   user  system elapsed 
   0.06    0.00    0.06

There you go. What did I win? ;-)

回复收藏 0 原文

陌伤ぢ 2024-08-28 23:26:53

...正如我刚刚在其他地方写的，vapply 是你的朋友！
...它就像 sapply，但您还指定了返回值类型，这使得它更快。

foo <- function(x) x+1
y <- numeric(1e6)

system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
#   user  system elapsed 
#   3.54    0.00    3.53 
system.time(z <- lapply(y, foo))
#   user  system elapsed 
#   2.89    0.00    2.91 
system.time(z <- vapply(y, foo, numeric(1)))
#   user  system elapsed 
#   1.35    0.00    1.36

2020 年 1 月 1 日更新：

system.time({z1 <- numeric(1e6); for(i in seq_along(y)) z1[i] <- foo(y[i])})
#   user  system elapsed 
#   0.52    0.00    0.53 
system.time(z <- lapply(y, foo))
#   user  system elapsed 
#   0.72    0.00    0.72 
system.time(z3 <- vapply(y, foo, numeric(1)))
#   user  system elapsed 
#    0.7     0.0     0.7 
identical(z1, z3)
# [1] TRUE

...and as I just wrote elsewhere, vapply is your friend!
...it's like sapply, but you also specify the return value type which makes it much faster.

foo <- function(x) x+1
y <- numeric(1e6)

system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
#   user  system elapsed 
#   3.54    0.00    3.53 
system.time(z <- lapply(y, foo))
#   user  system elapsed 
#   2.89    0.00    2.91 
system.time(z <- vapply(y, foo, numeric(1)))
#   user  system elapsed 
#   1.35    0.00    1.36

Jan. 1, 2020 update:

system.time({z1 <- numeric(1e6); for(i in seq_along(y)) z1[i] <- foo(y[i])})
#   user  system elapsed 
#   0.52    0.00    0.53 
system.time(z <- lapply(y, foo))
#   user  system elapsed 
#   0.72    0.00    0.72 
system.time(z3 <- vapply(y, foo, numeric(1)))
#   user  system elapsed 
#    0.7     0.0     0.7 
identical(z1, z3)
# [1] TRUE

回复收藏 0 原文

何必那么矫情 2024-08-28 23:26:53

我在其他地方写过，像 Shane 这样的示例并没有真正强调各种循环语法之间的性能差异，因为时间全部花费在函数内，而不是真正强调循环。此外，该代码不公平地将没有内存的 for 循环与返回值的 apply 系列函数进行比较。这是一个稍微不同的例子，强调了这一点。

foo <- function(x) {
   x <- x+1
 }
y <- numeric(1e6)
system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
#   user  system elapsed 
#  4.967   0.049   7.293 
system.time(z <- sapply(y, foo))
#   user  system elapsed 
#  5.256   0.134   7.965 
system.time(z <- lapply(y, foo))
#   user  system elapsed 
#  2.179   0.126   3.301

如果您打算保存结果，那么应用族函数可能比语法糖要更多。

（z 的简单 unlist 只有 0.2 秒，因此 lapply 速度要快得多。在 for 循环中初始化 z 非常快，因为我给出了 6 次运行中最后 5 次的平均值，因此将其移到系统之外。时间会但

还要注意的一件事是，使用应用族函数还有另一个原因，与它们的性能、清晰度或缺乏副作用无关。 for 循环通常会促进在循环中放入尽可能多的内容。这是因为每个循环都需要设置变量来存储信息（以及其他可能的操作）。 apply 语句往往会产生相反的偏见。通常，您希望对数据执行多项操作，其中一些操作可以矢量化，但有些操作可能无法矢量化。在 R 中，与其他语言不同，最好将这些操作分开，并运行那些在 apply 语句（或函数的矢量化版本）中未矢量化的操作以及作为真正矢量操作进行矢量化的操作。这通常会极大地提高性能。

以 Joris Meys 为例，他用方便的 R 函数替换了传统的 for 循环，我们可以使用它来展示以更 R 友好的方式编写代码的效率，从而在无需专门函数的情况下实现类似的加速。

set.seed(1)  #for reproducability of the results

# The data - copied from Joris Meys answer
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# an R way to generate tapply functionality that is fast and 
# shows more general principles about fast R coding
YZ <- interaction(Y, Z)
XS <- split(X, YZ)
m <- vapply(XS, mean, numeric(1))
m <- matrix(m, nrow = length(levels(Y)))
rownames(m) <- levels(Y)
colnames(m) <- levels(Z)
m

这最终比 for 循环快得多，只比内置优化的 tapply 函数慢一点。这并不是因为 vapply 比 for 快得多，而是因为它在循环的每次迭代中只执行一个操作。在此代码中，其他所有内容均已矢量化。在 Joris Meys 的传统 for 循环中，每次迭代都会发生许多（7？）操作，并且需要进行大量设置才能执行。另请注意，这比 for 版本紧凑得多。

I've written elsewhere that an example like Shane's doesn't really stress the difference in performance among the various kinds of looping syntax because the time is all spent within the function rather than actually stressing the loop. Furthermore, the code unfairly compares a for loop with no memory with apply family functions that return a value. Here's a slightly different example that emphasizes the point.

foo <- function(x) {
   x <- x+1
 }
y <- numeric(1e6)
system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
#   user  system elapsed 
#  4.967   0.049   7.293 
system.time(z <- sapply(y, foo))
#   user  system elapsed 
#  5.256   0.134   7.965 
system.time(z <- lapply(y, foo))
#   user  system elapsed 
#  2.179   0.126   3.301

If you plan to save the result then apply family functions can be much more than syntactic sugar.

(the simple unlist of z is only 0.2s so the lapply is much faster. Initializing the z in the for loop is quite fast because I'm giving the average of the last 5 of 6 runs so moving that outside the system.time would hardly affect things)

One more thing to note though is that there is another reason to use apply family functions independent of their performance, clarity, or lack of side effects. A for loop typically promotes putting as much as possible within the loop. This is because each loop requires setup of variables to store information (among other possible operations). Apply statements tend to be biased the other way. Often times you want to perform multiple operations on your data, several of which can be vectorized but some might not be able to be. In R, unlike other languages, it is best to separate those operations out and run the ones that are not vectorized in an apply statement (or vectorized version of the function) and the ones that are vectorized as true vector operations. This often speeds up performance tremendously.

Taking Joris Meys example where he replaces a traditional for loop with a handy R function we can use it to show the efficiency of writing code in a more R friendly manner for a similar speedup without the specialized function.

set.seed(1)  #for reproducability of the results

# The data - copied from Joris Meys answer
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# an R way to generate tapply functionality that is fast and 
# shows more general principles about fast R coding
YZ <- interaction(Y, Z)
XS <- split(X, YZ)
m <- vapply(XS, mean, numeric(1))
m <- matrix(m, nrow = length(levels(Y)))
rownames(m) <- levels(Y)
colnames(m) <- levels(Z)
m

This winds up being much faster than the for loop and just a little slower than the built in optimized tapply function. It's not because vapply is so much faster than for but because it is only performing one operation in each iteration of the loop. In this code everything else is vectorized. In Joris Meys traditional for loop many (7?) operations are occurring in each iteration and there's quite a bit of setup just for it to execute. Note also how much more compact this is than the for version.

回复收藏 0 原文

乱世争霸 2024-08-28 23:26:53

当对向量的子集应用函数时，tapply 比 for 循环要快得多。示例：

df <- data.frame(id = rep(letters[1:10], 100000),
                 value = rnorm(1000000))

f1 <- function(x)
  tapply(x$value, x$id, sum)

f2 <- function(x){
  res <- 0
  for(i in seq_along(l <- unique(x$id)))
    res[i] <- sum(x$value[x$id == l[i]])
  names(res) <- l
  res
}            

library(microbenchmark)

> microbenchmark(f1(df), f2(df), times=100)
Unit: milliseconds
   expr      min       lq   median       uq      max neval
 f1(df) 28.02612 28.28589 28.46822 29.20458 32.54656   100
 f2(df) 38.02241 41.42277 41.80008 42.05954 45.94273   100

apply，但是，在大多数情况下不会提供任何速度提升，并且在某些情况下甚至可能慢很多：

mat <- matrix(rnorm(1000000), nrow=1000)

f3 <- function(x)
  apply(x, 2, sum)

f4 <- function(x){
  res <- 0
  for(i in 1:ncol(x))
    res[i] <- sum(x[,i])
  res
}

> microbenchmark(f3(mat), f4(mat), times=100)
Unit: milliseconds
    expr      min       lq   median       uq      max neval
 f3(mat) 14.87594 15.44183 15.87897 17.93040 19.14975   100
 f4(mat) 12.01614 12.19718 12.40003 15.00919 40.59100   100

但是对于这些情况，我们有 colSums和rowSums：

f5 <- function(x)
  colSums(x) 

> microbenchmark(f5(mat), times=100)
Unit: milliseconds
    expr      min       lq   median       uq      max neval
 f5(mat) 1.362388 1.405203 1.413702 1.434388 1.992909   100

When applying functions over subsets of a vector, tapply can be pretty faster than a for loop. Example:

df <- data.frame(id = rep(letters[1:10], 100000),
                 value = rnorm(1000000))

f1 <- function(x)
  tapply(x$value, x$id, sum)

f2 <- function(x){
  res <- 0
  for(i in seq_along(l <- unique(x$id)))
    res[i] <- sum(x$value[x$id == l[i]])
  names(res) <- l
  res
}            

library(microbenchmark)

> microbenchmark(f1(df), f2(df), times=100)
Unit: milliseconds
   expr      min       lq   median       uq      max neval
 f1(df) 28.02612 28.28589 28.46822 29.20458 32.54656   100
 f2(df) 38.02241 41.42277 41.80008 42.05954 45.94273   100

apply, however, in most situation doesn't provide any speed increase, and in some cases can be even lot slower:

mat <- matrix(rnorm(1000000), nrow=1000)

f3 <- function(x)
  apply(x, 2, sum)

f4 <- function(x){
  res <- 0
  for(i in 1:ncol(x))
    res[i] <- sum(x[,i])
  res
}

> microbenchmark(f3(mat), f4(mat), times=100)
Unit: milliseconds
    expr      min       lq   median       uq      max neval
 f3(mat) 14.87594 15.44183 15.87897 17.93040 19.14975   100
 f4(mat) 12.01614 12.19718 12.40003 15.00919 40.59100   100

But for these situations we've got colSums and rowSums:

f5 <- function(x)
  colSums(x) 

> microbenchmark(f5(mat), times=100)
Unit: milliseconds
    expr      min       lq   median       uq      max neval
 f5(mat) 1.362388 1.405203 1.413702 1.434388 1.992909   100

回复收藏 0 原文

~没有更多了~