如何按组对变量求和
我有一个包含两列的数据框。第一列包含“第一”、“第二”、“第三”等类别,第二列的数字代表我看到“类别”中特定组的次数。
例如:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
我想按类别对数据进行排序并对所有频率求和:
Category Frequency
First 30
Second 5
Third 34
我将如何在 R 中执行此操作?
I have a data frame with two columns. First column contains categories such as "First", "Second", "Third", and the second column has numbers that represent the number of times I saw the specific groups from "Category".
For example:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
I want to sort the data by Category and sum all the Frequencies:
Category Frequency
First 30
Second 5
Third 34
How would I do this in R?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(19)
使用
aggregate
:在上面的示例中,可以在
list
中指定多个维度。相同数据类型的多个聚合指标可以通过 cbind 合并:(嵌入@thelatemail 注释),aggregate 也有一个公式接口
或者如果您想聚合多个列,您可以使用
.
表示法(也适用于一列)或
tapply
:使用此数据:
Using
aggregate
:In the example above, multiple dimensions can be specified in the
list
. Multiple aggregated metrics of the same data type can be incorporated viacbind
:(embedding @thelatemail comment),
aggregate
has a formula interface tooOr if you want to aggregate multiple columns, you could use the
.
notation (works for one column too)or
tapply
:Using this data:
您还可以使用 dplyr 包来实现此目的:
或者,对于多个汇总列(也适用于一列):
以下是如何通过以下方式汇总数据的更多示例使用内置数据集
mtcars
使用 dplyr 函数进行分组:从 dplyr 1.1.0 开始,您可以使用
.by
参数作为group_by()
的替代。有关详细信息,包括
%>%
运算符,请参阅 dplyr 简介。You can also use the dplyr package for that purpose:
Or, for multiple summary columns (works with one column too):
Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset
mtcars
:As of dplyr 1.1.0, you can use the
.by
argument as an alternative togroup_by()
.For more information, including the
%>%
operator, see the introduction to dplyr.rcs 提供的答案有效且简单。但是,如果您正在处理更大的数据集并且需要性能提升,则有一个更快的替代方案:
让我们将其与使用 data.frame 和上面的相同内容进行比较:
如果您想保留列,则这是语法:
差异对于较大的数据集,将变得更加明显,如下面的代码所示:
对于多个聚合,您可以将
lapply
和.SD
组合起来,如下所示The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:
Let's compare that to the same thing using data.frame and the above above:
And if you want to keep the column this is the syntax:
The difference will become more noticeable with larger datasets, as the code below demonstrates:
For multiple aggregations, you can combine
lapply
and.SD
as follows您还可以使用 by() 函数:
那些其他包(plyr、reshape)具有返回 data.frame 的优点,但值得熟悉 by(),因为它是一个基本函数。
You can also use the by() function:
Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.
几年后,只是为了添加另一个简单的基本 R 解决方案,由于某种原因这里没有出现 -
xtabs
或者如果你想要一个
data.frame
回来Several years later, just to add another simple base R solution that isn't present here for some reason-
xtabs
Or if you want a
data.frame
back如果
x
是包含您的数据的数据框,那么以下将执行您想要的操作:If
x
is a dataframe with your data, then the following will do what you want:虽然我最近已将大多数此类操作转换为 dplyr,但对于某些操作来说,sqldf 包仍然非常好(恕我直言,更具可读性)。
以下是如何使用 sqldf 回答此问题的示例
While I have recently become a convert to
dplyr
for most of these types of operations, thesqldf
package is still really nice (and IMHO more readable) for some things.Here is an example of how this question can be answered with
sqldf
只是添加第三个选项:
编辑:这是一个非常古老的答案。现在我建议使用
dplyr
中的group_by
和summarise
,如 @docendo 答案中所示。Just to add a third option:
EDIT: this is a very old answer. Now I would recommend the use of
group_by
andsummarise
fromdplyr
, as in @docendo answer.另一种解决方案按矩阵或数据帧中的组返回总和,并且简短而快速:
Another solution that returns sums by groups in a matrix or a data frame and is short and fast:
我发现
ave
< /a> 当您需要在不同列上应用不同的聚合函数(并且您必须/想要坚持使用基本 R)时非常有用(并且高效):例如,
给定此输入:
我们希望按
Categ1
和Categ2
并计算Samples
的总和以及Freq
的平均值。这是使用
ave
的可能解决方案:结果:
I find
ave
very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :e.g.
Given this input :
we want to group by
Categ1
andCateg2
and compute the sum ofSamples
and mean ofFreq
.Here's a possible solution using
ave
:Result :
从
dplyr 1.0.0
开始,可以使用across()
函数:如果对多个变量感兴趣:
并使用选择助手选择变量:
示例数据:
Since
dplyr 1.0.0
, theacross()
function could be used:If interested in multiple variables:
And the selection of variables using select helpers:
Sample data:
You could use the function `group.sum` from **package *Rfast***.
***Rfast*** has many group functions and `group.sum` is one of them.
Rfast 已弃用组函数,并将其替换为名为
group
的新函数。使用参数method
您可以选择正确的算法。因此,group.sum
是group(...,method = "sum")
。You could use the function `group.sum` from **package *Rfast***.
***Rfast*** has many group functions and `group.sum` is one of them.
Rfast has deprecated the group functions and replaced them with a new called
group
. Using argumentmethod
you can choose the correct algorithm. So,group.sum
isgroup(...,method = "sum")
.使用
cast
而不是recast
(注意'Frequency'
现在是'value'
)来获取:
using
cast
instead ofrecast
(note'Frequency'
is now'value'
)to get:
按组对变量求和的一个好方法是
来自base。这里只有
collapse::fsum
和Rfast::group.sum
更快。关于速度和内存消耗
是给定示例中最好的,在使用分组数据帧时可以加快速度。
这接近将数据集分割为每组子数据集的时间。
不同方法的基准测试表明,对于单列求和,
collapse::fsum
比Rfast::group.sum
快两倍,比快 7 倍行总和。其次是
tapply
、data.table
、by
和dplyr
。xtabs
和aggregate
是最慢的。聚合两列
collapse::fsum
再次是最快的,比Rfast::group.sum
快 3 倍,比rowsum
快 5 倍。接下来是data.table
、tapply
、by
和dplyr
。同样,xtabs
和aggregate
是最慢的。基准测试
有些方法允许执行可能有助于加速聚合的任务。
总结一栏。
总结两列
A good way to sum a variable by group is
from base. Here only
collapse::fsum
andRfast::group.sum
have been faster.Regarding speed and memory consumption
was the best in the given example which could be speed up when using a grouped data frame.
Which comes close to the timings when the dataset was split in subdatasets per group.
A benchmark on different methods shows that for summing up a single column
collapse::fsum
was two times faster thanRfast::group.sum
and 7 times faster thanrowsum
. They were followed bytapply
,data.table
,by
anddplyr
.xtabs
andaggregate
are the slowest.Aggregating two columns
collapse::fsum
is again the fastest, 3 times faster thanRfast::group.sum
and 5 times faster thenrowsum
. They are followed bydata.table
,tapply
,by
anddplyr
. Againxtabs
andaggregate
are the slowest.Benchmark
Some methods allow to do tasks which might help to speed up the aggregation.
Summing up one column.
Summing up two columns
您可以使用rowsum函数来计算频率。
头部看起来如下:
然后,
You can use
rowsum
function to calculate the frequency.head looks as follows:
then,
对于
dplyr 1.1.0
及更高版本,您可以在summarise
中使用.by
。此快捷方式避免使用group_by
并返回ungroup
ed数据框:With
dplyr 1.1.0
and above, you can use.by
insummarise
. This shortcut avoids to usegroup_by
and returns anungroup
ed data frame:为了完整起见,我将使用
pivot_wider
和参数values_fn=sum
添加另一个解决方案:如果数据集有两个分组变量,
pivot_wider
将创建一个漂亮的汇总表:For the sake of completeness, I will add another solution with
pivot_wider
with the argumentvalues_fn=sum
:If the dataset has two grouping variables,
pivot_wider
will create a nice summary table: