将聚合值连接回原始数据框
我反复使用的设计模式之一是对数据帧执行“分组”或“拆分、应用、组合 (SAC)”,然后将聚合数据连接回原始数据。例如,在包含许多州和县的数据框中计算每个县与州平均值的偏差时,这非常有用。我的汇总计算很少只是简单的平均值,但它是一个很好的例子。我经常通过以下方式解决这个问题:
require(plyr)
set.seed(1)
## set up some data
group1 <- rep(1:3, 4)
group2 <- sample(c("A","B","C"), 12, rep=TRUE)
values <- rnorm(12)
df <- data.frame(group1, group2, values)
## got some data, so let's aggregate
group1Mean <- ddply( df, "group1", function(x)
data.frame( meanValue = mean(x$values) ) )
df <- merge( df, group1Mean )
df
这会产生如下所示的良好聚合数据:
> df
group1 group2 values meanValue
1 1 A 0.48743 -0.121033
2 1 A -0.04493 -0.121033
3 1 C -0.62124 -0.121033
4 1 C -0.30539 -0.121033
5 2 A 1.51178 0.004804
6 2 B 0.73832 0.004804
7 2 A -0.01619 0.004804
8 2 B -2.21470 0.004804
9 3 B 1.12493 0.758598
10 3 C 0.38984 0.758598
11 3 B 0.57578 0.758598
12 3 A 0.94384 0.758598
这可行,但是是否有其他方法可以提高可读性、性能等?
One of the design patterns I use over and over is performing a "group by" or "split, apply, combine (SAC)" on a data frame and then joining the aggregated data back to the original data. This is useful, for example, when calculating each county's deviation from the state mean in a data frame with many states and counties. Rarely is my aggregate calculation only a simple mean, but it makes a good example. I often solve this problem the following way:
require(plyr)
set.seed(1)
## set up some data
group1 <- rep(1:3, 4)
group2 <- sample(c("A","B","C"), 12, rep=TRUE)
values <- rnorm(12)
df <- data.frame(group1, group2, values)
## got some data, so let's aggregate
group1Mean <- ddply( df, "group1", function(x)
data.frame( meanValue = mean(x$values) ) )
df <- merge( df, group1Mean )
df
Which produces nice aggregate data like the following:
> df
group1 group2 values meanValue
1 1 A 0.48743 -0.121033
2 1 A -0.04493 -0.121033
3 1 C -0.62124 -0.121033
4 1 C -0.30539 -0.121033
5 2 A 1.51178 0.004804
6 2 B 0.73832 0.004804
7 2 A -0.01619 0.004804
8 2 B -2.21470 0.004804
9 3 B 1.12493 0.758598
10 3 C 0.38984 0.758598
11 3 B 0.57578 0.758598
12 3 A 0.94384 0.758598
This works, but are there alternative ways of doing this which improve on readability, performance, etc?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一行代码就可以解决这个问题:
One line of code does the trick:
我认为
ave()
在这里比你展示的 plyr 调用更有用(我对 plyr 不太熟悉,不知道你是否可以直接用 plyr 做你想做的事情,我会感到惊讶如果你不能!)或其他基本 R 替代方案(aggregate()
、tapply()
)。:您可以使用
within()
或transform()
将此结果直接嵌入到df
中:如果顺序很重要:
I think
ave()
is more useful here than the plyr call you show (I'm not familiar enough with plyr to know if you can do what you want with plyr directly or not, I would be surprised if you can't!) or the other base R alternatives (aggregate()
,tapply()
).:You can use
within()
ortransform()
to embed this result directly intodf
:And if the ordering is important:
在性能方面,您可以使用
data.table< 执行相同类型的操作/code>
包,它内置了聚合,并且由于索引和基于 C 的实现而速度非常快。例如,给定的
df
已存在于您的示例中:<代码>
我还没有对它进行基准测试,但根据我的经验,它要快得多。
如果您决定走 data.table 道路(我认为如果您使用大型数据集,这是值得探索的),您确实需要阅读文档,因为如果您不知道数据框架之间存在一些差异,这些差异可能会困扰您他们。然而,值得注意的是,data.table 通常可以与任何需要数据帧的函数一起使用,因为 data.table 会声明其类型是数据帧(数据表继承自数据帧)。
[ 2011 年 2 月 ]
[ 2012 年 8 月 ] Matthew 更新:
2012 年 7 月发布到 CRAN 的 v1.8.2 中的新功能是
:=
by group。这与上面的答案非常相似,但是通过引用将新列添加到dt
,因此没有副本,也不需要合并步骤或重新列出现有列来返回与总量并列。无需先setkey
,它可以处理不连续的组(即未分组在一起的组)。对于大型数据集来说,这要快得多,并且语法简单而简短:
In terms of performance, you can do this same kind of operation using the
data.table
package, which has built in aggregation and is very fast thanks to indices and a C based implementation. For instance, givendf
already exists from your example:I have not benchmarked it, but in my experience it is a lot faster.
If you decide to go down the data.table road, which I think is worth exploring if you work with large data sets, you really need to read the docs because there are some differences from data frame that can bite you if you are unaware of them. However, notably data.table generally does work with any function expecting a data frame,as a data.table will claim its type is data frame (data table inherits from data frame).
[ Feb 2011 ]
[ Aug 2012 ] Update from Matthew :
New in v1.8.2 released to CRAN in July 2012 is
:=
by group. This is very similar to the answer above, but adds the new column by reference todt
so there is no copy and no need for a merge step or relisting existing columns to return alongside the aggregate. There is no need tosetkey
first, and it copes with non-contiguous groups (i.e. groups that aren't grouped together).This is signficantly faster for large datasets, and has a simple and short syntax :
您不能将
x
添加到传递给ddply
的函数中吗?Can't you just add
x
to the function you pass toddply
?dplyr 可能性:
这会按原始顺序返回数据帧。如果您希望按“group1”排序,请将
arrange(group1)
添加到管道中。A
dplyr
possibility:This returns the data frame in the original order. Add
arrange(group1)
to the pipe if you wish to order by "group1".