分组函数(tapply、by、aggregate)和 *apply 系列

发布于 2024-09-14 22:10:23 字数 1044 浏览 10 评论 0原文

每当我想在 R 中执行“map”py 操作时,我通常会尝试使用 apply 系列中的函数。

但是,我从来没有完全理解它们之间的区别 - {sapplylapply 等}如何将函数应用于输入/分组输入,输出是什么看起来会是什么样子,甚至输入可能是什么——所以我经常把它们全部浏览一遍,直到得到我想要的。

有人可以解释一下何时使用哪一个吗?

我当前(可能不正确/不完整)的理解是...

  1. sapply(vec, f):输入是一个向量。输出是一个向量/矩阵,其中元素 if(vec[i]),如果 f 具有多个,则为您提供一个矩阵元素输出

  2. lapply(vec, f):与 sapply 相同,但输出是一个列表?

  3. apply(matrix, 1/2, f):输入是一个矩阵。输出是一个向量,其中元素i是f(矩阵的行/列i)
  4. tapply(向量,分组,f):输出是一个矩阵/数组,其中矩阵/数组中的元素是向量分组 g 处的 f 值,并且 g 被推送到行/列名称
  5. by(dataframe, grouping, f):令 g 为分组。将 f 应用于组/数据帧的每一列。漂亮地打印每列的分组和 f 的值。
  6. aggregate(matrix, grouping, f):类似于 by,但聚合不是漂亮地打印输出,而是将所有内容粘贴到数据框中。

附带问题:我还没有学习 plyr 或 reshape —— plyrreshape 会完全取代所有这些吗?

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.

However, I've never quite understood the differences between them -- how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.

Can someone explain how to use which one when?

My current (probably incorrect/incomplete) understanding is...

  1. sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output

  2. lapply(vec, f): same as sapply, but output is a list?

  3. apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
  4. tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
  5. by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
  6. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

北陌 2024-09-21 22:10:23

R 有许多 *apply 函数,这些函数在帮助文件中都有详细描述(例如 ?apply)。然而,它们的数量足够多,初学者可能很难决定哪一个适合他们的情况,甚至很难记住它们。他们可能有一个普遍的感觉,即“我应该在这里使用 *apply 函数”,但一开始很难让它们保持一致。

尽管事实上(在其他答案中指出)*apply 系列的大部分功能都包含在极其流行的 plyr 包中,但基本函数仍然有用且值得了解。

此答案旨在充当新用户的一种路标,帮助引导他们针对特定问题找到正确的 *apply 函数。请注意,这不是旨在简单地反省或替换 R 文档!希望这个答案可以帮助您决定哪个 *apply 函数适合您的情况,然后由您进一步研究。除一例外,性能差异将不会得到解决。

  • 应用 - 当您想要将函数应用于行或列时
    矩阵(和高维类似物);对于数据帧通常不建议这样做,因为它会首先强制转换为矩阵。

     # 二维矩阵
     M <- 矩阵(seq(1,16), 4, 4)
    
     # 将最小值应用于行
     应用(M,1,分钟)
     [1] 1 2 3 4
    
     # 将最大值应用于列
     应用(M,2,最大)
     [1] 4 8 12 16
    
     # 3 维数组
     M <- 数组( seq(32), dim = c(4,4,2))
    
     # 对每个 M[*, , ] 应用求和 - 即在第二个和第三个维度上求和
     应用(M,1,总和)
     # 结果是一维的
     [1] 120 128 136 144
    
     # 对每个 M[*, *, ] 应用总和 - 即在第三维上求和
     应用(M,c(1,2),总和)
     # 结果是二维的
          [,1] [,2] [,3] [,4]
     [1、]18 26 34 42
     [2、]20 28 36 44
     [3、]22 30 38 46
     [4、]24 32 40 48
    

    如果您想要二维矩阵的行/列平均值或总和,请务必
    研究高度优化、快速的 colMeans
    rowMeanscolSumsrowSums

  • lapply - 当您想要将函数应用于 a 的每个元素时
    依次列出并返回列表。

    这是许多其他 *apply 函数的主力。果皮
    返回他们的代码,您经常会在下面找到 lapply

     x <- 列表(a = 1, b = 1:3, c = 10:100) 
     lapply(x, FUN = 长度) 
     $a 
     [1] 1
     $b 
     [1]3
     $c 
     [1] 91
     lapply(x, FUN = 总和) 
     $a 
     [1] 1
     $b 
     [1]6
     $c 
     [1]5005
    
  • sapply - 当您想要将函数应用于 a 的每个元素时
    依次列出,但您需要一个向量返回,而不是一个列表。

    如果您发现自己正在输入 unlist(lapply(...)),请停下来考虑一下
    申请

     x <- 列表(a = 1, b = 1:3, c = 10:100)
     # 与上面比较;命名向量,而不是列表 
     sapply(x, FUN = 长度)  
     ABC 1 3 91
    
     sapply(x, FUN = 总和)   
     ABC 1 6 5005 
    

    sapply 的更高级使用中,它将尝试强制
    如果合适的话,将结果转换为多维数组。例如,如果我们的函数返回相同长度的向量,sapply 会将它们用作矩阵的列:

     sapply(1:5,函数(x) rnorm(3,x))
    

    如果我们的函数返回一个二维矩阵,sapply 基本上会做同样的事情,将每个返回的矩阵视为单个长向量:

    <前><代码> sapply(1:5,函数(x)矩阵(x,2,2))

    除非我们指定simplify = "array",在这种情况下它将使用各个矩阵来构建多维数组:

     sapply(1:5,函数(x)矩阵(x,2,2),简化=“数组”)
    

    这些行为中的每一个当然都取决于我们的函数返回相同长度或维度的向量或矩阵。

  • vapply - 当您想使用 sapply 但可能需要
    从代码中挤出更多速度或想要更多类型安全性.

    对于vapply,你基本上给R一个例子来说明什么样的事情
    您的函数将返回,这可以节省一些强制返回的时间
    适合单个原子向量的值。

     x <- 列表(a = 1, b = 1:3, c = 10:100)
     #注意,由于这里的优势主要是速度,所以这个
     # 示例仅用于说明。我们告诉 R
     # length() 返回的所有内容都应该是整数 
     # 长度1。 
     vapply(x, FUN = 长度, FUN.VALUE = 0L) 
     ABC 1 3 91
    
  • ma​​ply - 当你有多个数据结构时(例如
    向量、列表)并且您想要将函数应用于第一个元素
    每个的,然后是每个的第二个元素,依此类推,强制结果
    到向量/数组,如sapply

    从你的函数必须接受的意义上来说,这是多变量的
    多个参数。

     #对第一个元素、第二个元素等求和。 
     mapply(总和, 1:5, 1:5, 1:5) 
     [1] 3 6 9 12 15
     #执行rep(1,4)、rep(2,3)等。
     mapply(重复、1:4、4:1)   
     [[1]]
     [1] 1 1 1 1
    
     [[2]]
     [1] 2 2 2
    
     [[3]]
     [1] 3 3
    
     [[4]]
     [1] 4
    
  • Map - 使用SIMPLIFY = FALSEmaply进行包装,因此保证返回一个列表。

     地图(总和, 1:5, 1:5, 1:5)
     [[1]]
     [1]3
    
     [[2]]
     [1]6
    
     [[3]]
     [1] 9
    
     [[4]]
     [1] 12
    
     [[5]]
     [1] 15
    
  • rapply - 当您想要递归地将函数应用于嵌套列表结构的每个元素时。

    为了让您了解 rapply 是多么不常见,我在第一次发布此答案时忘记了它!显然,我确信很多人都使用它,但是YMMV。 rapply 最好用用户定义的应用函数来说明:

     # 追加!字符串,否则递增
     myFun <- 函数(x){
         if(is.character(x)){
           返回(粘贴(x,“!”,sep =“”))
         }
         别的{
           返回(x + 1)
         }
     }
    
     #嵌套列表结构
     l <- 列表(a = 列表(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
               b = 3,c =“哎呀”, 
               d = 列表(a2 = 1, b2 = 列表(a3 = "嘿", b3 = 5)))
    
    
     # 结果命名为向量,强制转换为字符          
     拉普(l,myFun)
    
     # 结果是像 l 一样的嵌套列表,但值已更改
     rapply(l, myFun, how="替换")
    
  • tapply - 当您想要将函数应用于子集一个
    向量和子集由一些其他向量定义,通常是
    因素。

    可以说是 *apply 家族中的害群之马。帮助文件的使用
    “ragged array”这个短语可能有点令人困惑,但它实际上是
    很简单。

    向量:

    <前><代码> x <- 1:20

    定义组的因素(相同长度!):

     y <- 因子(rep(字母[1:5],每个= 4))
    

    y 定义的每个子组中的 x 中的值相加:

     tapply(x, y, 总和)  
      abcde 10 26 42 58 74 
    

    可以在定义子组的情况下处理更复杂的示例
    通过一系列因素的独特组合。 点击
    本质上类似于 split-apply-combine 函数
    R 中常见(aggregatebyaveddply 等),因此它
    害群之马的地位。

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

  • apply - When you want to apply a function to the rows or columns
    of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

     # Two dimensional matrix
     M <- matrix(seq(1,16), 4, 4)
    
     # apply min to rows
     apply(M, 1, min)
     [1] 1 2 3 4
    
     # apply max to columns
     apply(M, 2, max)
     [1]  4  8 12 16
    
     # 3 dimensional array
     M <- array( seq(32), dim = c(4,4,2))
    
     # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
     apply(M, 1, sum)
     # Result is one-dimensional
     [1] 120 128 136 144
    
     # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
     apply(M, c(1,2), sum)
     # Result is two-dimensional
          [,1] [,2] [,3] [,4]
     [1,]   18   26   34   42
     [2,]   20   28   36   44
     [3,]   22   30   38   46
     [4,]   24   32   40   48
    

    If you want row/column means or sums for a 2D matrix, be sure to
    investigate the highly optimized, lightning-quick colMeans,
    rowMeans, colSums, rowSums.

  • lapply - When you want to apply a function to each element of a
    list in turn and get a list back.

    This is the workhorse of many of the other *apply functions. Peel
    back their code and you will often find lapply underneath.

     x <- list(a = 1, b = 1:3, c = 10:100) 
     lapply(x, FUN = length) 
     $a 
     [1] 1
     $b 
     [1] 3
     $c 
     [1] 91
     lapply(x, FUN = sum) 
     $a 
     [1] 1
     $b 
     [1] 6
     $c 
     [1] 5005
    
  • sapply - When you want to apply a function to each element of a
    list in turn, but you want a vector back, rather than a list.

    If you find yourself typing unlist(lapply(...)), stop and consider
    sapply.

     x <- list(a = 1, b = 1:3, c = 10:100)
     # Compare with above; a named vector, not a list 
     sapply(x, FUN = length)  
     a  b  c   
     1  3 91
    
     sapply(x, FUN = sum)   
     a    b    c    
     1    6 5005 
    

    In more advanced uses of sapply it will attempt to coerce the
    result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

     sapply(1:5,function(x) rnorm(3,x))
    

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

     sapply(1:5,function(x) matrix(x,2,2))
    

    Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

     sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
    

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

  • vapply - When you want to use sapply but perhaps need to
    squeeze some more speed out of your code or want more type safety.

    For vapply, you basically give R an example of what sort of thing
    your function will return, which can save some time coercing returned
    values to fit in a single atomic vector.

     x <- list(a = 1, b = 1:3, c = 10:100)
     #Note that since the advantage here is mainly speed, this
     # example is only for illustration. We're telling R that
     # everything returned by length() should be an integer of 
     # length 1. 
     vapply(x, FUN = length, FUN.VALUE = 0L) 
     a  b  c  
     1  3 91
    
  • mapply - For when you have several data structures (e.g.
    vectors, lists) and you want to apply a function to the 1st elements
    of each, and then the 2nd elements of each, etc., coercing the result
    to a vector/array as in sapply.

    This is multivariate in the sense that your function must accept
    multiple arguments.

     #Sums the 1st elements, the 2nd elements, etc. 
     mapply(sum, 1:5, 1:5, 1:5) 
     [1]  3  6  9 12 15
     #To do rep(1,4), rep(2,3), etc.
     mapply(rep, 1:4, 4:1)   
     [[1]]
     [1] 1 1 1 1
    
     [[2]]
     [1] 2 2 2
    
     [[3]]
     [1] 3 3
    
     [[4]]
     [1] 4
    
  • Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

     Map(sum, 1:5, 1:5, 1:5)
     [[1]]
     [1] 3
    
     [[2]]
     [1] 6
    
     [[3]]
     [1] 9
    
     [[4]]
     [1] 12
    
     [[5]]
     [1] 15
    
  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

     # Append ! to string, otherwise increment
     myFun <- function(x){
         if(is.character(x)){
           return(paste(x,"!",sep=""))
         }
         else{
           return(x + 1)
         }
     }
    
     #A nested list structure
     l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
               b = 3, c = "Yikes", 
               d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
    
    
     # Result is named vector, coerced to character          
     rapply(l, myFun)
    
     # Result is a nested list like l, with values altered
     rapply(l, myFun, how="replace")
    
  • tapply - For when you want to apply a function to subsets of a
    vector and the subsets are defined by some other vector, usually a
    factor.

    The black sheep of the *apply family, of sorts. The help file's use of
    the phrase "ragged array" can be a bit confusing, but it is actually
    quite simple.

    A vector:

     x <- 1:20
    

    A factor (of the same length!) defining groups:

     y <- factor(rep(letters[1:5], each = 4))
    

    Add up the values in x within each subgroup defined by y:

     tapply(x, y, sum)  
      a  b  c  d  e  
     10 26 42 58 74 
    

    More complex examples can be handled where the subgroups are defined
    by the unique combinations of a list of several factors. tapply is
    similar in spirit to the split-apply-combine functions that are
    common in R (aggregate, by, ave, ddply, etc.) Hence its
    black sheep status.

长不大的小祸害 2024-09-21 22:10:23

顺便说一句,以下是各种 plyr 函数与基本 *apply 函数的对应关系(来自 plyr 网页 http://had.co.nz/plyr/)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply 

plyr 的目标之一是为每个函数提供一致的命名约定,在函数名称中对输入和输出数据类型进行编码。它还提供了输出的一致性,因为 dlply() 的输出可以轻松传递给 ldply() 以产生有用的输出等。

从概念上讲,学习 plyr< /code> 并不比理解基本的 *apply 函数更困难。

plyrreshape 函数几乎取代了我日常使用中的所有这些函数。但是,也来自 Plyr 文档的介绍:

相关函数tapplysweepplyr中没有对应的函数,但仍然有用。 merge 对于将摘要与原始数据合并非常有用。

On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply 

One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.

Conceptually, learning plyr is no more difficult than understanding the base *apply functions.

plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:

Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.

寄风 2024-09-21 22:10:23

来自 http://www.slideshare.net/hadley/plyr 的幻灯片 21 -one-data-analytic-strategy:

apply, sapply, lapply, by,aggregate

(希望很清楚apply 对应于 @Hadley 的 aaplyaggregate 对应于 @Hadley 的 ddply 等。同一幻灯片的第 20 张幻灯片如果您没有从此图像中得到它,将会澄清。)

(左侧是输入,顶部是输出)

From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:

apply, sapply, lapply, by, aggregate

(Hopefully it's clear that apply corresponds to @Hadley's aaply and aggregate corresponds to @Hadley's ddply etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)

(on the left is input, on the top is output)

浮华 2024-09-21 22:10:23

首先从 Joran 的出色回答开始——怀疑是否有任何事情可以做得更好。

那么以下助记符可能有助于记住它们之间的区别。虽然有些是显而易见的,但另一些可能不那么明显——对于这些,你会在乔兰的讨论中找到理由。

助记符

  • lapply 是一个列表应用,它作用于列表或向量并返回一个列表。
  • sapply 是一个simple lapply(函数默认在可能的情况下返回向量或矩阵)
  • vapply 是一个 >verified apply(允许预先指定返回对象类型)
  • rapply 是嵌套列表的递归应用,即列表中的列表
  • tapply code> 是一个标记应用,其中标签标识子集
  • 应用通用:将函数应用于矩阵的行或列(或者,更多)通常,针对数组的维度)

构建正确的背景

如果使用apply系列对您来说仍然感觉有点陌生,那么可能是您缺少一个关键观点看法。

这两篇文章可以提供帮助。它们提供了必要的背景来激发 apply 函数系列提供的函数式编程技术

Lisp 的用户会立即认出这个范式。如果您不熟悉 Lisp,一旦您了解了 FP,您就会获得在 R 中使用的强大观点——并且 apply 会更有意义。

First start with Joran's excellent answer -- doubtful anything can better that.

Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.

Mnemonics

  • lapply is a list apply which acts on a list or vector and returns a list.
  • sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
  • vapply is a verified apply (allows the return object type to be prespecified)
  • rapply is a recursive apply for nested lists, i.e. lists within lists
  • tapply is a tagged apply where the tags identify the subsets
  • apply is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)

Building the Right Background

If using the apply family still feels a bit alien to you, then it might be that you're missing a key point of view.

These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply family of functions.

Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply will make a lot more sense.

£冰雨忧蓝° 2024-09-21 22:10:23

因为我意识到这篇文章的(非常优秀的)答案缺乏 byaggregate 解释。这是我的贡献。

BY

正如文档中所述,by 函数可以作为 tapply 的“包装器”。当我们想要计算 tapply 无法处理的任务时,by 的威力就会显现出来。一个例子是这样的代码:

ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )

 cb
iris$Species: setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 
-------------------------------------------------------------- 
iris$Species: versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 
-------------------------------------------------------------- 
iris$Species: virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 


ct
$setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 

$versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 

$virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 

如果我们打印这两个对象,ctcb,我们“本质上”有相同的结果,唯一的区别在于它们的显示方式和不同的 class 属性,分别为 cbbyctarray

正如我所说,当我们无法使用 tapply 时,by 的威力就会显现出来;以下代码是一个示例:

 tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) : 
  arguments must have same length

R 表示参数必须具有相同的长度,例如“我们想要计算 iris 中所有变量的 summary 以及因子 Species”:但 R 无法做到这一点,因为它不知道如何处理。

使用by函数,R为dataframe类调度一个特定的方法,然后让summary函数工作,即使第一个参数的长度(并且类型也不同)。

bywork <- by(iris, iris$Species, summary )

bywork
iris$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
-------------------------------------------------------------- 
iris$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
-------------------------------------------------------------- 
iris$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     

它确实有效,而且结果非常令人惊讶。它是 by 类的一个对象,它沿着 Species(例如,对于每个物种)计算每个变量的summary

请注意,如果第一个参数是数据框,则分派函数必须具有该对象类的方法。例如,如果我们将这段代码与 mean 函数一起使用,我们将得到这段毫无意义的代码:

 by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
------------------------------------------- 
iris$Species: versicolor
[1] NA
------------------------------------------- 
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA

AGGREGATE

aggregate 可以被视为另一种不同的使用方式 < code>tapply 如果我们这样使用的话。

at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)

 at
    setosa versicolor  virginica 
     5.006      5.936      6.588 
 ag
     Group.1     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

两个直接的区别是,aggregate 的第二个参数必须是一个列表,而 tapply 可以(非强制)是一个列表,aggregate 的输出是一个数据帧,而 tapply 的输出是一个 array

aggregate 的强大之处在于它可以使用 subset 参数轻松处理数据子集,并且它具有用于 ts 对象和 的方法公式也是如此。

在某些情况下,这些元素使 aggregate 更容易与 tapply 配合使用。
以下是一些示例(可在文档中找到):

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

 ag
  supp dose   len
1   OJ  0.5 13.23
2   VC  0.5  7.98
3   OJ  1.0 22.70
4   VC  1.0 16.77
5   OJ  2.0 26.06
6   VC  2.0 26.14

我们可以使用 tapply 实现相同的效果,但语法稍难,并且输出(在某些情况下)可读性较差:

att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)

 att
       OJ    VC
0.5 13.23  7.98
1   22.70 16.77
2   26.06 26.14

有时我们无法做到这一点使用bytapply,我们必须使用aggregate

 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

 ag1
  Month    Ozone     Temp
1     5 23.61538 66.73077
2     6 29.44444 78.22222
3     7 59.11538 83.88462
4     8 59.96154 83.96154
5     9 31.44828 76.89655

我们无法在一次调用中使用 tapply 获得先前的结果,但我们必须计算每个元素沿 Month 的平均值,然后将它们组合起来(另请注意,我们必须调用na.rm = TRUE,因为 aggregate 函数的 formula 方法默认具有 na.action = na.omit< /code>):

ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)

 cbind(ta1, ta2)
       ta1      ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000

而使用 by 时,我们实际上无法实现以下函数调用返回错误(但很可能它与提供的函数 mean):

by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)

其他时候结果是相同的,差异仅在于类(然后是如何显示/打印,而不仅仅是——例如,如何对其进行子集化)对象:

byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)

前面的代码实现了相同的目标和结果,在某些时候使用什么工具只是个人品味和需求的问题;前两个对象在子集化方面有非常不同的需求。

Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.

BY

The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:

ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )

 cb
iris$Species: setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 
-------------------------------------------------------------- 
iris$Species: versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 
-------------------------------------------------------------- 
iris$Species: virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 


ct
$setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 

$versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 

$virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 

If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.

As I've said, the power of by arises when we can't use tapply; the following code is one example:

 tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) : 
  arguments must have same length

R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.

With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.

bywork <- by(iris, iris$Species, summary )

bywork
iris$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
-------------------------------------------------------------- 
iris$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
-------------------------------------------------------------- 
iris$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     

it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.

Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:

 by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
------------------------------------------- 
iris$Species: versicolor
[1] NA
------------------------------------------- 
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA

AGGREGATE

aggregate can be seen as another a different way of use tapply if we use it in such a way.

at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)

 at
    setosa versicolor  virginica 
     5.006      5.936      6.588 
 ag
     Group.1     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.

The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.

These elements make aggregate easier to work with that tapply in some situations.
Here are some examples (available in documentation):

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

 ag
  supp dose   len
1   OJ  0.5 13.23
2   VC  0.5  7.98
3   OJ  1.0 22.70
4   VC  1.0 16.77
5   OJ  2.0 26.06
6   VC  2.0 26.14

We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:

att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)

 att
       OJ    VC
0.5 13.23  7.98
1   22.70 16.77
2   26.06 26.14

There are other times when we can't use by or tapply and we have to use aggregate.

 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

 ag1
  Month    Ozone     Temp
1     5 23.61538 66.73077
2     6 29.44444 78.22222
3     7 59.11538 83.88462
4     8 59.96154 83.96154
5     9 31.44828 76.89655

We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):

ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)

 cbind(ta1, ta2)
       ta1      ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000

while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):

by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)

Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:

byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)

The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.

野稚 2024-09-21 22:10:23

有很多很好的答案讨论了每个功能用例的差异。没有一个答案讨论性能差异。这是合理的,因为各种函数期望不同的输入并产生不同的输出,但它们中的大多数都有一个通用的共同目标来按系列/组进行评估。我的答案将集中在性能上。由于上述情况,向量的输入创建包含在计时中,因此 apply 函数也未测量。

我同时测试了两个不同的函数 sumlength。测试的输入音量为 50M,输出音量为 50K。我还包含了两个当前流行的软件包,data.tabledplyr,它们在提出问题时并未广泛使用。如果您的目标是获得良好的性能,那么两者绝对值得一看。

library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing = list()

# sapply
timing[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing[["dplyr"]] = system.time({
    df = data_frame(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
       function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
#    sapply     lapply     tapply         by  aggregate      dplyr data.table 
#      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 

# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
              )[,.(fun = V1, elapsed = V2)
                ][order(-elapsed)]
#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply function is not measured.

I have tested two different functions sum and length at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table and dplyr. Both are definitely worth to look if you are aiming for good performance.

library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing = list()

# sapply
timing[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing[["dplyr"]] = system.time({
    df = data_frame(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
       function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
#    sapply     lapply     tapply         by  aggregate      dplyr data.table 
#      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 

# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
              )[,.(fun = V1, elapsed = V2)
                ][order(-elapsed)]
#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686
烟沫凡尘 2024-09-21 22:10:23

尽管这里有很多很好的答案,但还有 2 个基本函数值得一提,有用的 outer 函数和晦涩的 eapply 函数

outer

outer 是一个非常有用的函数,隐藏在一个更普通的函数中。如果您阅读了 outer 的帮助,它的描述是:

The outer product of the arrays X and Y is the array A with dimension  
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =   
FUN(X[arrayindex.x], Y[arrayindex.y], ...).

这使得它看起来只对线性代数类型的东西有用。但是,它可以像 maply 一样使用,将函数应用于两个输入向量。不同之处在于,maply 会将函数应用于前两个元素,然后是后两个元素,依此类推,而 outer 会将函数应用于第一个元素中的每个元素的组合向量和第二个向量之一。例如:

 A<-c(1,3,5,7,9)
 B<-c(0,3,6,9,12)

mapply(FUN=pmax, A, B)

> mapply(FUN=pmax, A, B)
[1]  1  3  6  9 12

outer(A,B, pmax)

 > outer(A,B, pmax)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    3    6    9   12
 [2,]    3    3    6    9   12
 [3,]    5    5    6    9   12
 [4,]    7    7    7    9   12
 [5,]    9    9    9    9   12

当我有一个值向量和一个条件向量并希望查看哪些值满足哪些条件时,我个人就使用过这个。

eapply

eapplylapply 类似,只不过它不是将函数应用于列表中的每个元素,而是将函数应用于环境中的每个元素。例如,如果您想在全局环境中查找用户定义函数的列表:

A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}

> eapply(.GlobalEnv, is.function)
$A
[1] FALSE

$B
[1] FALSE

$C
[1] FALSE

$D
[1] TRUE 

坦率地说,我不太使用它,但如果您正在构建很多包或创建很多环境,它可能会派上用场。

Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer function and the obscure eapply function

outer

outer is a very useful function hidden as a more mundane one. If you read the help for outer its description says:

The outer product of the arrays X and Y is the array A with dimension  
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =   
FUN(X[arrayindex.x], Y[arrayindex.y], ...).

which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply to apply a function to two vectors of inputs. The difference is that mapply will apply the function to the first two elements and then the second two etc, whereas outer will apply the function to every combination of one element from the first vector and one from the second. For example:

 A<-c(1,3,5,7,9)
 B<-c(0,3,6,9,12)

mapply(FUN=pmax, A, B)

> mapply(FUN=pmax, A, B)
[1]  1  3  6  9 12

outer(A,B, pmax)

 > outer(A,B, pmax)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    3    6    9   12
 [2,]    3    3    6    9   12
 [3,]    5    5    6    9   12
 [4,]    7    7    7    9   12
 [5,]    9    9    9    9   12

I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.

eapply

eapply is like lapply except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:

A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}

> eapply(.GlobalEnv, is.function)
$A
[1] FALSE

$B
[1] FALSE

$C
[1] FALSE

$D
[1] TRUE 

Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.

北方的韩爷 2024-09-21 22:10:23

也许值得一提的是aveavetapply 的友好表弟。它以一种可以直接插入到数据框中的形式返回结果。

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

基础包中没有任何东西可以像ave一样用于整个数据帧(因为by类似于数据帧的tapply)。但你可以捏造它:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

It is maybe worth mentioning ave. ave is tapply's friendly cousin. It returns results in a form that you can plug straight back into your data frame.

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

There is nothing in the base package that works like ave for whole data frames (as by is like tapply for data frames). But you can fudge it:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...
渡你暖光 2024-09-21 22:10:23

我最近发现了相当有用的 sweep 函数,为了完整起见,将其添加到此处:

sweep

基本思想是扫描数组行- 或按列并返回修改后的数组。一个例子可以清楚地说明这一点(来源:datacamp ):

假设您有一个矩阵,并且想要按列标准化它:

dataPoints <- matrix(4:15, nrow = 4)

# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)

# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)

# Center the points 
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")

# Return the result
dataPoints_Trans1
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5

# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")

# Return the result
dataPoints_Trans2
##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

注意 :这个简单的例子当然可以通过
apply(dataPoints, 2,scale)更容易地实现相同的结果

I recently discovered the rather useful sweep function and add it here for the sake of completeness:

sweep

The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):

Let's say you have a matrix and want to standardize it column-wise:

dataPoints <- matrix(4:15, nrow = 4)

# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)

# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)

# Center the points 
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")

# Return the result
dataPoints_Trans1
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5

# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")

# Return the result
dataPoints_Trans2
##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

NB: for this simple example the same result can of course be achieved more easily by
apply(dataPoints, 2, scale)

漫漫岁月 2024-09-21 22:10:23

在最近在 CRAN 上发布的 collapse 包中,我尝试将大多数常见的应用功能压缩为 2 个函数:

  1. dapply(数据应用)将函数应用于行或(默认)矩阵和 data.frames 的列,并且(默认)返​​回相同类型和相同属性的对象(除非每个计算的结果是原子的并且 drop = TRUE )。性能与 data.frame 列的 lapply 相当,比矩阵行或列的 apply 快约 2 倍。并行性可通过 mclapply 实现(仅适用于 MAC)。

语法:

dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L, 
       return = c("same", "matrix", "data.frame"), drop = TRUE)

示例:

# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...
  1. BY 是 S3 泛型,用于使用向量、矩阵和 data.frame 方法进行分割-应用-组合计算。它比 tapplybyaggregate 快得多(在大数据上也比 plyr 更快)dplyr 更快)。

语法:

BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

示例:

# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix 
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...

分组变量列表也可以提供给g

谈论性能:collapse 的一个主要目标是促进 R 中的高性能编程,并超越拆分-应用-组合。为此,该软件包具有一整套基于 C++ 的快速通用函数:fmeanfmedianfmodefsumfprodfsdfvarfminfmaxffirstflastfNobsfNdistinctfscalef BetweenfwithinfHD BetweenfHDwithinflagfdifffgrowth。它们在一次数据传递中执行分组计算(即不进行拆分和重新组合)。

语法:

fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)

示例:

v <- iris$Sepal.Length
f <- iris$Species

# Vectors
fmean(v)             # mean
fmean(v, f)          # grouped mean
fsd(v, f)            # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f)         # grouped standardizing (scaling and centering)
fwithin(v, f)        # grouped demeaning

w <- abs(rnorm(nrow(iris)))
fmean(v, w = w)      # Weighted mean
fmean(v, f, w)       # Weighted grouped mean
fsd(v, f, w)         # Weighted grouped standard-deviation
fsd(v, f, w, "/")    # Weighted grouped scaling
fscale(v, f, w)      # Weighted grouped standardizing
fwithin(v, f, w)     # Weighted grouped demeaning

# Same using data.frames...
fmean(iris[-5], f)                # grouped mean
fscale(iris[-5], f)               # grouped standardizing
fwithin(iris[-5], f)              # grouped demeaning

# Same with matrices ...

在 vi​​gnettes 包中,我提供了基准测试。使用快速函数进行编程比使用 dplyrdata.table 进行编程要快得多,尤其是对于较小的数据,而且对于大数据也是如此。

In the collapse package recently released on CRAN, I have attempted to compress most of the common apply functionality into just 2 functions:

  1. dapply (Data-Apply) applies functions to rows or (default) columns of matrices and data.frames and (default) returns an object of the same type and with the same attributes (unless the result of each computation is atomic and drop = TRUE). The performance is comparable to lapply for data.frame columns, and about 2x faster than apply for matrix rows or columns. Parallelism is available via mclapply (only for MAC).

Syntax:

dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L, 
       return = c("same", "matrix", "data.frame"), drop = TRUE)

Examples:

# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...
  1. BY is a S3 generic for split-apply-combine computing with vector, matrix and data.frame method. It is significantly faster than tapply, by and aggregate (an also faster than plyr, on large data dplyr is faster though).

Syntax:

BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

Examples:

# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix 
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...

Lists of grouping variables can also be supplied to g.

Talking about performance: A main goal of collapse is to foster high-performance programming in R and to move beyond split-apply-combine alltogether. For this purpose the package has a full set of C++ based fast generic functions: fmean, fmedian, fmode, fsum, fprod, fsd, fvar, fmin, fmax, ffirst, flast, fNobs, fNdistinct, fscale, fbetween, fwithin, fHDbetween, fHDwithin, flag, fdiff and fgrowth. They perform grouped computations in a single pass through the data (i.e. no splitting and recombining).

Syntax:

fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)

Examples:

v <- iris$Sepal.Length
f <- iris$Species

# Vectors
fmean(v)             # mean
fmean(v, f)          # grouped mean
fsd(v, f)            # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f)         # grouped standardizing (scaling and centering)
fwithin(v, f)        # grouped demeaning

w <- abs(rnorm(nrow(iris)))
fmean(v, w = w)      # Weighted mean
fmean(v, f, w)       # Weighted grouped mean
fsd(v, f, w)         # Weighted grouped standard-deviation
fsd(v, f, w, "/")    # Weighted grouped scaling
fscale(v, f, w)      # Weighted grouped standardizing
fwithin(v, f, w)     # Weighted grouped demeaning

# Same using data.frames...
fmean(iris[-5], f)                # grouped mean
fscale(iris[-5], f)               # grouped standardizing
fwithin(iris[-5], f)              # grouped demeaning

# Same with matrices ...

In the package vignettes I provide benchmarks. Programming with the fast functions is significantly faster than programming with dplyr or data.table, especially on smaller data, but also on large data.

迷爱 2024-09-21 22:10:23

从 R 4.3.0 开始,tapply 将支持数据帧,并且 tapplyby 将支持使用公式对数据帧行进行分组。

> R.version.string
[1] "R version 4.3.0 beta (2023-04-07 r84200)"
> dd <- data.frame(x = 1:10, f = gl(5L, 2L), g = gl(2L, 5L))
    x f g
1   1 1 1
2   2 1 1
3   3 2 1
4   4 2 1
5   5 3 1
6   6 3 2
7   7 4 2
8   8 4 2
9   9 5 2
10 10 5 2
> tapply(dd, ~f + g, nrow)
   g
f   1 2
  1 2 0
  2 2 0
  3 1 1
  4 0 2
  5 0 2
> by(dd, ~g, identity)
g: 1
  x f g
1 1 1 1
2 2 1 1
3 3 2 1
4 4 2 1
5 5 3 1
------------------------------------------------------------ 
g: 2
    x f g
6   6 3 2
7   7 4 2
8   8 4 2
9   9 5 2
10 10 5 2

Starting in R 4.3.0, tapply will support data frames and both tapply and by will support grouping data frame rows with a formula.

> R.version.string
[1] "R version 4.3.0 beta (2023-04-07 r84200)"
> dd <- data.frame(x = 1:10, f = gl(5L, 2L), g = gl(2L, 5L))
    x f g
1   1 1 1
2   2 1 1
3   3 2 1
4   4 2 1
5   5 3 1
6   6 3 2
7   7 4 2
8   8 4 2
9   9 5 2
10 10 5 2
> tapply(dd, ~f + g, nrow)
   g
f   1 2
  1 2 0
  2 2 0
  3 1 1
  4 0 2
  5 0 2
> by(dd, ~g, identity)
g: 1
  x f g
1 1 1 1
2 2 1 1
3 3 2 1
4 4 2 1
5 5 3 1
------------------------------------------------------------ 
g: 2
    x f g
6   6 3 2
7   7 4 2
8   8 4 2
9   9 5 2
10 10 5 2
酷到爆炸 2024-09-21 22:10:23

某些软件包还有一些上面未讨论的替代方案。

parallels 包中的 parApply() 函数提供了 apply 系列函数的替代方法,用于在集群上执行并行计算。 R 中并行计算的其他替代方案包括 foreach 包和 doParallel 包,它们允许并行执行循环和函数。 future 包提供了一个简单且一致的 API 来使用 future,这是一种异步(并行或顺序)计算表达式的方法。此外,purrr 包提供了一种迭代和映射的函数式编程方法,并通过 future 包支持并行化。

以下是一些示例

parApply() 示例:

library(parallel)

# Create a matrix
m <- matrix(1:20, nrow = 5)

# Define a function to apply to each column of the matrix
my_fun <- function(x) {
  x^2
}

# Apply the function to each column of the matrix in parallel
result <- parApply(cl = makeCluster(2), X = m, MARGIN = 2, FUN = my_fun)

# View the result
result

foreach 示例:

library(foreach)
library(doParallel)

# Register a parallel backend
registerDoParallel(cores = 2)

# Create a list of numbers
my_list <- list(1, 2, 3, 4, 5)

# Define a function to apply to each element of the list
my_fun <- function(x) {
  x^2
}

# Apply the function to each element of the list in parallel
result <- foreach(i = my_list) %dopar% my_fun(i)

# View the result
result

未来示例:

library(future)

# Plan to use a parallel backend
plan(multisession, workers = 2)

# Create a list of numbers
my_list <- list(1, 2, 3, 4, 5)

# Define a function to apply to each element of the list
my_fun <- function(x) {
  x^2
}

# Apply the function to each element of the list in parallel using futures
result <- future_map(my_list, my_fun)

# View the result
result

purrr 示例:

library(purrr)
library(future)

# Plan to use a parallel backend
plan(multisession, workers = 2)

# Create a list of numbers
my_list <- list(1, 2, 3, 4, 5)

# Define a function to apply to each element of the list
my_fun <- function(x) {
  x^2
}

# Apply the function to each element of the list in parallel using purrr
result <- future_map(my_list, my_fun)

# View the result
result

编辑 2023-07-02(由 future 作者):替换已弃用且不再存在的多进程 未来具有多会话的后端。

There are some alternatives from some packages as well which are not discussed above.

The parApply() function in the parallels package provides an alternative to the apply family of functions for executing parallel computations on a cluster. Other alternatives for parallel computation in R include the foreach package and the doParallel package, which allow for parallel execution of loops and functions. The future package provides a simple and consistent API for using futures, which are a way to evaluate expressions asynchronously, either in parallel or sequentially. Additionally, the purrr package provides a functional programming approach to iteration and mapping, and supports parallelization through the future package.

Here are some examples

parApply() example:

library(parallel)

# Create a matrix
m <- matrix(1:20, nrow = 5)

# Define a function to apply to each column of the matrix
my_fun <- function(x) {
  x^2
}

# Apply the function to each column of the matrix in parallel
result <- parApply(cl = makeCluster(2), X = m, MARGIN = 2, FUN = my_fun)

# View the result
result

foreach example:

library(foreach)
library(doParallel)

# Register a parallel backend
registerDoParallel(cores = 2)

# Create a list of numbers
my_list <- list(1, 2, 3, 4, 5)

# Define a function to apply to each element of the list
my_fun <- function(x) {
  x^2
}

# Apply the function to each element of the list in parallel
result <- foreach(i = my_list) %dopar% my_fun(i)

# View the result
result

future example:

library(future)

# Plan to use a parallel backend
plan(multisession, workers = 2)

# Create a list of numbers
my_list <- list(1, 2, 3, 4, 5)

# Define a function to apply to each element of the list
my_fun <- function(x) {
  x^2
}

# Apply the function to each element of the list in parallel using futures
result <- future_map(my_list, my_fun)

# View the result
result

purrr example:

library(purrr)
library(future)

# Plan to use a parallel backend
plan(multisession, workers = 2)

# Create a list of numbers
my_list <- list(1, 2, 3, 4, 5)

# Define a function to apply to each element of the list
my_fun <- function(x) {
  x^2
}

# Apply the function to each element of the list in parallel using purrr
result <- future_map(my_list, my_fun)

# View the result
result

EDIT 2023-07-02 (by future author): Replaced deprecated and no-longer existing multiprocess future backend with multisession.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文