分组函数(tapply、by、aggregate)和 *apply 系列
每当我想在 R 中执行“map”py 操作时,我通常会尝试使用 apply
系列中的函数。
但是,我从来没有完全理解它们之间的区别 - {sapply
、lapply
等}如何将函数应用于输入/分组输入,输出是什么看起来会是什么样子,甚至输入可能是什么——所以我经常把它们全部浏览一遍,直到得到我想要的。
有人可以解释一下何时使用哪一个吗?
我当前(可能不正确/不完整)的理解是...
sapply(vec, f)
:输入是一个向量。输出是一个向量/矩阵,其中元素i
是f(vec[i])
,如果f
具有多个,则为您提供一个矩阵元素输出lapply(vec, f)
:与sapply
相同,但输出是一个列表?apply(matrix, 1/2, f)
:输入是一个矩阵。输出是一个向量,其中元素i
是f(矩阵的行/列i)tapply(向量,分组,f)
:输出是一个矩阵/数组,其中矩阵/数组中的元素是向量分组g
处的f
值,并且g
被推送到行/列名称by(dataframe, grouping, f)
:令g
为分组。将f
应用于组/数据帧的每一列。漂亮地打印每列的分组和f
的值。aggregate(matrix, grouping, f)
:类似于by
,但聚合不是漂亮地打印输出,而是将所有内容粘贴到数据框中。
附带问题:我还没有学习 plyr 或 reshape —— plyr
或 reshape
会完全取代所有这些吗?
Whenever I want to do something "map"py in R, I usually try to use a function in the apply
family.
However, I've never quite understood the differences between them -- how {sapply
, lapply
, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.
Can someone explain how to use which one when?
My current (probably incorrect/incomplete) understanding is...
sapply(vec, f)
: input is a vector. output is a vector/matrix, where elementi
isf(vec[i])
, giving you a matrix iff
has a multi-element outputlapply(vec, f)
: same assapply
, but output is a list?apply(matrix, 1/2, f)
: input is a matrix. output is a vector, where elementi
is f(row/col i of the matrix)tapply(vector, grouping, f)
: output is a matrix/array, where an element in the matrix/array is the value off
at a groupingg
of the vector, andg
gets pushed to the row/col namesby(dataframe, grouping, f)
: letg
be a grouping. applyf
to each column of the group/dataframe. pretty print the grouping and the value off
at each column.aggregate(matrix, grouping, f)
: similar toby
, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
Side question: I still haven't learned plyr or reshape -- would plyr
or reshape
replace all of these entirely?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
R 有许多 *apply 函数,这些函数在帮助文件中都有详细描述(例如
?apply
)。然而,它们的数量足够多,初学者可能很难决定哪一个适合他们的情况,甚至很难记住它们。他们可能有一个普遍的感觉,即“我应该在这里使用 *apply 函数”,但一开始很难让它们保持一致。尽管事实上(在其他答案中指出)*apply 系列的大部分功能都包含在极其流行的
plyr
包中,但基本函数仍然有用且值得了解。此答案旨在充当新用户的一种路标,帮助引导他们针对特定问题找到正确的 *apply 函数。请注意,这不是旨在简单地反省或替换 R 文档!希望这个答案可以帮助您决定哪个 *apply 函数适合您的情况,然后由您进一步研究。除一例外,性能差异将不会得到解决。
应用 - 当您想要将函数应用于行或列时
矩阵(和高维类似物);对于数据帧通常不建议这样做,因为它会首先强制转换为矩阵。
如果您想要二维矩阵的行/列平均值或总和,请务必
研究高度优化、快速的
colMeans
,rowMeans
、colSums
、rowSums
。lapply - 当您想要将函数应用于 a 的每个元素时
依次列出并返回列表。
这是许多其他 *apply 函数的主力。果皮
返回他们的代码,您经常会在下面找到
lapply
。sapply - 当您想要将函数应用于 a 的每个元素时
依次列出,但您需要一个向量返回,而不是一个列表。
如果您发现自己正在输入
unlist(lapply(...))
,请停下来考虑一下申请
。在
sapply
的更高级使用中,它将尝试强制如果合适的话,将结果转换为多维数组。例如,如果我们的函数返回相同长度的向量,
sapply
会将它们用作矩阵的列:如果我们的函数返回一个二维矩阵,
sapply
基本上会做同样的事情,将每个返回的矩阵视为单个长向量:<前><代码> sapply(1:5,函数(x)矩阵(x,2,2))
除非我们指定
simplify = "array"
,在这种情况下它将使用各个矩阵来构建多维数组:这些行为中的每一个当然都取决于我们的函数返回相同长度或维度的向量或矩阵。
vapply - 当您想使用
sapply
但可能需要从代码中挤出更多速度或想要更多类型安全性.
对于
vapply
,你基本上给R一个例子来说明什么样的事情您的函数将返回,这可以节省一些强制返回的时间
适合单个原子向量的值。
maply - 当你有多个数据结构时(例如
向量、列表)并且您想要将函数应用于第一个元素
每个的,然后是每个的第二个元素,依此类推,强制结果
到向量/数组,如
sapply
。从你的函数必须接受的意义上来说,这是多变量的
多个参数。
Map - 使用
SIMPLIFY = FALSE
对maply
进行包装,因此保证返回一个列表。rapply - 当您想要递归地将函数应用于嵌套列表结构的每个元素时。
为了让您了解
rapply
是多么不常见,我在第一次发布此答案时忘记了它!显然,我确信很多人都使用它,但是YMMV。rapply
最好用用户定义的应用函数来说明:tapply - 当您想要将函数应用于子集一个
向量和子集由一些其他向量定义,通常是
因素。
可以说是 *apply 家族中的害群之马。帮助文件的使用
“ragged array”这个短语可能有点令人困惑,但它实际上是
很简单。
向量:
<前><代码> x <- 1:20
定义组的因素(相同长度!):
将
y
定义的每个子组中的x
中的值相加:可以在定义子组的情况下处理更复杂的示例
通过一系列因素的独特组合。
点击
是本质上类似于 split-apply-combine 函数
R 中常见(
aggregate
、by
、ave
、ddply
等),因此它害群之马的地位。
R has many *apply functions which are ably described in the help files (e.g.
?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular
plyr
package, the base functions remain useful and worth knowing.This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns
of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
If you want row/column means or sums for a 2D matrix, be sure to
investigate the highly optimized, lightning-quick
colMeans
,rowMeans
,colSums
,rowSums
.lapply - When you want to apply a function to each element of a
list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel
back their code and you will often find
lapply
underneath.sapply - When you want to apply a function to each element of a
list in turn, but you want a vector back, rather than a list.
If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.In more advanced uses of
sapply
it will attempt to coerce theresult to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,
sapply
will use them as columns of a matrix:If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use
sapply
but perhaps need tosqueeze some more speed out of your code or want more type safety.
For
vapply
, you basically give R an example of what sort of thingyour function will return, which can save some time coercing returned
values to fit in a single atomic vector.
mapply - For when you have several data structures (e.g.
vectors, lists) and you want to apply a function to the 1st elements
of each, and then the 2nd elements of each, etc., coercing the result
to a vector/array as in
sapply
.This is multivariate in the sense that your function must accept
multiple arguments.
Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:tapply - For when you want to apply a function to subsets of a
vector and the subsets are defined by some other vector, usually a
factor.
The black sheep of the *apply family, of sorts. The help file's use of
the phrase "ragged array" can be a bit confusing, but it is actually
quite simple.
A vector:
A factor (of the same length!) defining groups:
Add up the values in
x
within each subgroup defined byy
:More complex examples can be handled where the subgroups are defined
by the unique combinations of a list of several factors.
tapply
issimilar in spirit to the split-apply-combine functions that are
common in R (
aggregate
,by
,ave
,ddply
, etc.) Hence itsblack sheep status.
顺便说一句,以下是各种
plyr
函数与基本*apply
函数的对应关系(来自 plyr 网页 http://had.co.nz/plyr/)plyr
的目标之一是为每个函数提供一致的命名约定,在函数名称中对输入和输出数据类型进行编码。它还提供了输出的一致性,因为dlply()
的输出可以轻松传递给ldply()
以产生有用的输出等。从概念上讲,学习
plyr< /code> 并不比理解基本的
*apply
函数更困难。plyr
和reshape
函数几乎取代了我日常使用中的所有这些函数。但是,也来自 Plyr 文档的介绍:On the side note, here is how the various
plyr
functions correspond to the base*apply
functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)One of the goals of
plyr
is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output fromdlply()
is easily passable toldply()
to produce useful output, etc.Conceptually, learning
plyr
is no more difficult than understanding the base*apply
functions.plyr
andreshape
functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:来自 http://www.slideshare.net/hadley/plyr 的幻灯片 21 -one-data-analytic-strategy:
(希望很清楚
apply
对应于 @Hadley 的aaply
和aggregate
对应于 @Hadley 的ddply
等。同一幻灯片的第 20 张幻灯片如果您没有从此图像中得到它,将会澄清。)(左侧是输入,顶部是输出)
From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:
(Hopefully it's clear that
apply
corresponds to @Hadley'saaply
andaggregate
corresponds to @Hadley'sddply
etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)(on the left is input, on the top is output)
首先从 Joran 的出色回答开始——怀疑是否有任何事情可以做得更好。
那么以下助记符可能有助于记住它们之间的区别。虽然有些是显而易见的,但另一些可能不那么明显——对于这些,你会在乔兰的讨论中找到理由。
助记符
lapply
是一个列表应用,它作用于列表或向量并返回一个列表。sapply
是一个simplelapply
(函数默认在可能的情况下返回向量或矩阵)vapply
是一个 >verified apply(允许预先指定返回对象类型)rapply
是嵌套列表的递归应用,即列表中的列表tapply
code> 是一个标记应用,其中标签标识子集应用
是通用:将函数应用于矩阵的行或列(或者,更多)通常,针对数组的维度)构建正确的背景
如果使用
apply
系列对您来说仍然感觉有点陌生,那么可能是您缺少一个关键观点看法。这两篇文章可以提供帮助。它们提供了必要的背景来激发
apply
函数系列提供的函数式编程技术。Lisp 的用户会立即认出这个范式。如果您不熟悉 Lisp,一旦您了解了 FP,您就会获得在 R 中使用的强大观点——并且
apply
会更有意义。First start with Joran's excellent answer -- doubtful anything can better that.
Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.
Mnemonics
lapply
is a list apply which acts on a list or vector and returns a list.sapply
is a simplelapply
(function defaults to returning a vector or matrix when possible)vapply
is a verified apply (allows the return object type to be prespecified)rapply
is a recursive apply for nested lists, i.e. lists within liststapply
is a tagged apply where the tags identify the subsetsapply
is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)Building the Right Background
If using the
apply
family still feels a bit alien to you, then it might be that you're missing a key point of view.These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the
apply
family of functions.Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and
apply
will make a lot more sense.因为我意识到这篇文章的(非常优秀的)答案缺乏
by
和aggregate
解释。这是我的贡献。BY
正如文档中所述,
by
函数可以作为tapply
的“包装器”。当我们想要计算tapply
无法处理的任务时,by
的威力就会显现出来。一个例子是这样的代码:如果我们打印这两个对象,
ct
和cb
,我们“本质上”有相同的结果,唯一的区别在于它们的显示方式和不同的class
属性,分别为cb
的by
和ct
的array
。正如我所说,当我们无法使用
tapply
时,by
的威力就会显现出来;以下代码是一个示例:R 表示参数必须具有相同的长度,例如“我们想要计算
iris
中所有变量的summary
以及因子Species
”:但 R 无法做到这一点,因为它不知道如何处理。使用
by
函数,R为dataframe
类调度一个特定的方法,然后让summary
函数工作,即使第一个参数的长度(并且类型也不同)。它确实有效,而且结果非常令人惊讶。它是
by
类的一个对象,它沿着Species
(例如,对于每个物种)计算每个变量的summary
。请注意,如果第一个参数是数据框,则分派函数必须具有该对象类的方法。例如,如果我们将这段代码与
mean
函数一起使用,我们将得到这段毫无意义的代码:AGGREGATE
aggregate
可以被视为另一种不同的使用方式 < code>tapply 如果我们这样使用的话。两个直接的区别是,
aggregate
的第二个参数必须是一个列表,而tapply
可以(非强制)是一个列表,aggregate
的输出是一个数据帧,而tapply
的输出是一个array
。aggregate
的强大之处在于它可以使用subset
参数轻松处理数据子集,并且它具有用于ts
对象和的方法公式也是如此。
在某些情况下,这些元素使
aggregate
更容易与tapply
配合使用。以下是一些示例(可在文档中找到):
我们可以使用
tapply
实现相同的效果,但语法稍难,并且输出(在某些情况下)可读性较差:有时我们无法做到这一点使用
by
或tapply
,我们必须使用aggregate
。我们无法在一次调用中使用
tapply
获得先前的结果,但我们必须计算每个元素沿Month
的平均值,然后将它们组合起来(另请注意,我们必须调用na.rm = TRUE
,因为aggregate
函数的formula
方法默认具有na.action = na.omit< /code>):
而使用
by
时,我们实际上无法实现以下函数调用返回错误(但很可能它与提供的函数mean
):其他时候结果是相同的,差异仅在于类(然后是如何显示/打印,而不仅仅是——例如,如何对其进行子集化)对象:
前面的代码实现了相同的目标和结果,在某些时候使用什么工具只是个人品味和需求的问题;前两个对象在子集化方面有非常不同的需求。
Since I realized that (the very excellent) answers of this post lack of
by
andaggregate
explanations. Here is my contribution.BY
The
by
function, as stated in the documentation can be though, as a "wrapper" fortapply
. The power ofby
arises when we want to compute a task thattapply
can't handle. One example is this code:If we print these two objects,
ct
andcb
, we "essentially" have the same results and the only differences are in how they are shown and the differentclass
attributes, respectivelyby
forcb
andarray
forct
.As I've said, the power of
by
arises when we can't usetapply
; the following code is one example:R says that arguments must have the same lengths, say "we want to calculate the
summary
of all variable iniris
along the factorSpecies
": but R just can't do that because it does not know how to handle.With the
by
function R dispatch a specific method fordata frame
class and then let thesummary
function works even if the length of the first argument (and the type too) are different.it works indeed and the result is very surprising. It is an object of class
by
that alongSpecies
(say, for each of them) computes thesummary
of each variable.Note that if the first argument is a
data frame
, the dispatched function must have a method for that class of objects. For example is we use this code with themean
function we will have this code that has no sense at all:AGGREGATE
aggregate
can be seen as another a different way of usetapply
if we use it in such a way.The two immediate differences are that the second argument of
aggregate
must be a list whiletapply
can (not mandatory) be a list and that the output ofaggregate
is a data frame while the one oftapply
is anarray
.The power of
aggregate
is that it can handle easily subsets of the data withsubset
argument and that it has methods forts
objects andformula
as well.These elements make
aggregate
easier to work with thattapply
in some situations.Here are some examples (available in documentation):
We can achieve the same with
tapply
but the syntax is slightly harder and the output (in some circumstances) less readable:There are other times when we can't use
by
ortapply
and we have to useaggregate
.We cannot obtain the previous result with
tapply
in one call but we have to calculate the mean alongMonth
for each elements and then combine them (also note that we have to call thena.rm = TRUE
, because theformula
methods of theaggregate
function has by default thena.action = na.omit
):while with
by
we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function,mean
):Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.
有很多很好的答案讨论了每个功能用例的差异。没有一个答案讨论性能差异。这是合理的,因为各种函数期望不同的输入并产生不同的输出,但它们中的大多数都有一个通用的共同目标来按系列/组进行评估。我的答案将集中在性能上。由于上述情况,向量的输入创建包含在计时中,因此
apply
函数也未测量。我同时测试了两个不同的函数
sum
和length
。测试的输入音量为 50M,输出音量为 50K。我还包含了两个当前流行的软件包,data.table
和dplyr
,它们在提出问题时并未广泛使用。如果您的目标是获得良好的性能,那么两者绝对值得一看。There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the
apply
function is not measured.I have tested two different functions
sum
andlength
at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked,data.table
anddplyr
. Both are definitely worth to look if you are aiming for good performance.尽管这里有很多很好的答案,但还有 2 个基本函数值得一提,有用的
outer
函数和晦涩的eapply
函数outer
outer
是一个非常有用的函数,隐藏在一个更普通的函数中。如果您阅读了outer
的帮助,它的描述是:这使得它看起来只对线性代数类型的东西有用。但是,它可以像
maply
一样使用,将函数应用于两个输入向量。不同之处在于,maply
会将函数应用于前两个元素,然后是后两个元素,依此类推,而outer
会将函数应用于第一个元素中的每个元素的组合向量和第二个向量之一。例如:当我有一个值向量和一个条件向量并希望查看哪些值满足哪些条件时,我个人就使用过这个。
eapply
eapply
与lapply
类似,只不过它不是将函数应用于列表中的每个元素,而是将函数应用于环境中的每个元素。例如,如果您想在全局环境中查找用户定义函数的列表:坦率地说,我不太使用它,但如果您正在构建很多包或创建很多环境,它可能会派上用场。
Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful
outer
function and the obscureeapply
functionouter
outer
is a very useful function hidden as a more mundane one. If you read the help forouter
its description says:which makes it seem like this is only useful for linear algebra type things. However, it can be used much like
mapply
to apply a function to two vectors of inputs. The difference is thatmapply
will apply the function to the first two elements and then the second two etc, whereasouter
will apply the function to every combination of one element from the first vector and one from the second. For example:I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.
eapply
eapply
is likelapply
except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.
也许值得一提的是
ave
。ave
是tapply
的友好表弟。它以一种可以直接插入到数据框中的形式返回结果。基础包中没有任何东西可以像
ave
一样用于整个数据帧(因为by
类似于数据帧的tapply
)。但你可以捏造它:It is maybe worth mentioning
ave
.ave
istapply
's friendly cousin. It returns results in a form that you can plug straight back into your data frame.There is nothing in the base package that works like
ave
for whole data frames (asby
is liketapply
for data frames). But you can fudge it:我最近发现了相当有用的
sweep
函数,为了完整起见,将其添加到此处:sweep
基本思想是扫描数组行- 或按列并返回修改后的数组。一个例子可以清楚地说明这一点(来源:datacamp ):
假设您有一个矩阵,并且想要按列标准化它:
注意 :这个简单的例子当然可以通过
apply(dataPoints, 2,scale)更容易地实现相同的结果
I recently discovered the rather useful
sweep
function and add it here for the sake of completeness:sweep
The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):
Let's say you have a matrix and want to standardize it column-wise:
NB: for this simple example the same result can of course be achieved more easily by
apply(dataPoints, 2, scale)
在最近在 CRAN 上发布的 collapse 包中,我尝试将大多数常见的应用功能压缩为 2 个函数:
dapply
(数据应用)将函数应用于行或(默认)矩阵和 data.frames 的列,并且(默认)返回相同类型和相同属性的对象(除非每个计算的结果是原子的并且 drop = TRUE )。性能与 data.frame 列的lapply
相当,比矩阵行或列的apply
快约 2 倍。并行性可通过mclapply
实现(仅适用于 MAC)。语法:
示例:
BY
是 S3 泛型,用于使用向量、矩阵和 data.frame 方法进行分割-应用-组合计算。它比tapply
、by
和aggregate
快得多(在大数据上也比plyr
更快)dplyr 更快)。
语法:
示例:
分组变量列表也可以提供给
g
。谈论性能:collapse 的一个主要目标是促进 R 中的高性能编程,并超越拆分-应用-组合。为此,该软件包具有一整套基于 C++ 的快速通用函数:
fmean
、fmedian
、fmode
、fsum
、fprod
、fsd
、fvar
、fmin
、fmax
、ffirst
、flast
、fNobs
、fNdistinct
、fscale
、f Between
、fwithin
、fHD Between
、fHDwithin
、flag
、fdiff
和fgrowth
。它们在一次数据传递中执行分组计算(即不进行拆分和重新组合)。语法:
示例:
在 vignettes 包中,我提供了基准测试。使用快速函数进行编程比使用 dplyr 或 data.table 进行编程要快得多,尤其是对于较小的数据,而且对于大数据也是如此。
In the collapse package recently released on CRAN, I have attempted to compress most of the common apply functionality into just 2 functions:
dapply
(Data-Apply) applies functions to rows or (default) columns of matrices and data.frames and (default) returns an object of the same type and with the same attributes (unless the result of each computation is atomic anddrop = TRUE
). The performance is comparable tolapply
for data.frame columns, and about 2x faster thanapply
for matrix rows or columns. Parallelism is available viamclapply
(only for MAC).Syntax:
Examples:
BY
is a S3 generic for split-apply-combine computing with vector, matrix and data.frame method. It is significantly faster thantapply
,by
andaggregate
(an also faster thanplyr
, on large datadplyr
is faster though).Syntax:
Examples:
Lists of grouping variables can also be supplied to
g
.Talking about performance: A main goal of collapse is to foster high-performance programming in R and to move beyond split-apply-combine alltogether. For this purpose the package has a full set of C++ based fast generic functions:
fmean
,fmedian
,fmode
,fsum
,fprod
,fsd
,fvar
,fmin
,fmax
,ffirst
,flast
,fNobs
,fNdistinct
,fscale
,fbetween
,fwithin
,fHDbetween
,fHDwithin
,flag
,fdiff
andfgrowth
. They perform grouped computations in a single pass through the data (i.e. no splitting and recombining).Syntax:
Examples:
In the package vignettes I provide benchmarks. Programming with the fast functions is significantly faster than programming with dplyr or data.table, especially on smaller data, but also on large data.
从 R 4.3.0 开始,
tapply
将支持数据帧,并且tapply
和by
将支持使用公式对数据帧行进行分组。Starting in R 4.3.0,
tapply
will support data frames and bothtapply
andby
will support grouping data frame rows with a formula.某些软件包还有一些上面未讨论的替代方案。
parallels
包中的parApply()
函数提供了 apply 系列函数的替代方法,用于在集群上执行并行计算。 R 中并行计算的其他替代方案包括 foreach 包和 doParallel 包,它们允许并行执行循环和函数。future
包提供了一个简单且一致的 API 来使用 future,这是一种异步(并行或顺序)计算表达式的方法。此外,purrr 包提供了一种迭代和映射的函数式编程方法,并通过 future 包支持并行化。以下是一些示例
parApply() 示例:
foreach 示例:
未来示例:
purrr 示例:
编辑 2023-07-02(由 future 作者):替换已弃用且不再存在的
多进程 未来具有
多会话
的后端。There are some alternatives from some packages as well which are not discussed above.
The
parApply()
function in theparallels
package provides an alternative to the apply family of functions for executing parallel computations on a cluster. Other alternatives for parallel computation in R include theforeach
package and thedoParallel
package, which allow for parallel execution of loops and functions. Thefuture
package provides a simple and consistent API for using futures, which are a way to evaluate expressions asynchronously, either in parallel or sequentially. Additionally, thepurrr
package provides a functional programming approach to iteration and mapping, and supports parallelization through thefuture
package.Here are some examples
parApply() example:
foreach example:
future example:
purrr example:
EDIT 2023-07-02 (by future author): Replaced deprecated and no-longer existing
multiprocess
future backend withmultisession
.