将聚合值连接回原始数据框

发布于 2024-10-18 01:30:58 字数 1111 浏览 3 评论 0原文

我反复使用的设计模式之一是对数据帧执行“分组”或“拆分、应用、组合 (SAC)”,然后将聚合数据连接回原始数据。例如,在包含许多州和县的数据框中计算每个县与州平均值的偏差时,这非常有用。我的汇总计算很少只是简单的平均值,但它是一个很好的例子。我经常通过以下方式解决这个问题:

require(plyr)
set.seed(1)

## set up some data
group1 <- rep(1:3, 4)
group2 <- sample(c("A","B","C"), 12, rep=TRUE) 
values <- rnorm(12)
df <- data.frame(group1, group2, values)

## got some data, so let's aggregate

group1Mean <- ddply( df, "group1", function(x) 
                     data.frame( meanValue = mean(x$values) ) )
df <- merge( df, group1Mean )
df

这会产生如下所示的良好聚合数据:

> df
   group1 group2   values meanValue
1       1      A  0.48743 -0.121033
2       1      A -0.04493 -0.121033
3       1      C -0.62124 -0.121033
4       1      C -0.30539 -0.121033
5       2      A  1.51178  0.004804
6       2      B  0.73832  0.004804
7       2      A -0.01619  0.004804
8       2      B -2.21470  0.004804
9       3      B  1.12493  0.758598
10      3      C  0.38984  0.758598
11      3      B  0.57578  0.758598
12      3      A  0.94384  0.758598

这可行,但是是否有其他方法可以提高可读性、性能等?

One of the design patterns I use over and over is performing a "group by" or "split, apply, combine (SAC)" on a data frame and then joining the aggregated data back to the original data. This is useful, for example, when calculating each county's deviation from the state mean in a data frame with many states and counties. Rarely is my aggregate calculation only a simple mean, but it makes a good example. I often solve this problem the following way:

require(plyr)
set.seed(1)

## set up some data
group1 <- rep(1:3, 4)
group2 <- sample(c("A","B","C"), 12, rep=TRUE) 
values <- rnorm(12)
df <- data.frame(group1, group2, values)

## got some data, so let's aggregate

group1Mean <- ddply( df, "group1", function(x) 
                     data.frame( meanValue = mean(x$values) ) )
df <- merge( df, group1Mean )
df

Which produces nice aggregate data like the following:

> df
   group1 group2   values meanValue
1       1      A  0.48743 -0.121033
2       1      A -0.04493 -0.121033
3       1      C -0.62124 -0.121033
4       1      C -0.30539 -0.121033
5       2      A  1.51178  0.004804
6       2      B  0.73832  0.004804
7       2      A -0.01619  0.004804
8       2      B -2.21470  0.004804
9       3      B  1.12493  0.758598
10      3      C  0.38984  0.758598
11      3      B  0.57578  0.758598
12      3      A  0.94384  0.758598

This works, but are there alternative ways of doing this which improve on readability, performance, etc?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

痴者 2024-10-25 01:30:58

一行代码就可以解决这个问题:

new <- ddply( df, "group1", transform, numcolwise(mean))
new

group1 group2      values    meanValue
1       1      A  0.48742905 -0.121033381
2       1      A -0.04493361 -0.121033381
3       1      C -0.62124058 -0.121033381
4       1      C -0.30538839 -0.121033381
5       2      A  1.51178117  0.004803931
6       2      B  0.73832471  0.004803931
7       2      A -0.01619026  0.004803931
8       2      B -2.21469989  0.004803931
9       3      B  1.12493092  0.758597929
10      3      C  0.38984324  0.758597929
11      3      B  0.57578135  0.758597929
12      3      A  0.94383621  0.758597929

identical(df, new)
[1] TRUE

One line of code does the trick:

new <- ddply( df, "group1", transform, numcolwise(mean))
new

group1 group2      values    meanValue
1       1      A  0.48742905 -0.121033381
2       1      A -0.04493361 -0.121033381
3       1      C -0.62124058 -0.121033381
4       1      C -0.30538839 -0.121033381
5       2      A  1.51178117  0.004803931
6       2      B  0.73832471  0.004803931
7       2      A -0.01619026  0.004803931
8       2      B -2.21469989  0.004803931
9       3      B  1.12493092  0.758597929
10      3      C  0.38984324  0.758597929
11      3      B  0.57578135  0.758597929
12      3      A  0.94383621  0.758597929

identical(df, new)
[1] TRUE
生来就爱笑 2024-10-25 01:30:58

我认为 ave() 在这里比你展示的 plyr 调用更有用(我对 plyr 不太熟悉,不知道你是否可以直接用 plyr 做你想做的事情,我会感到惊讶如果你不能!)或其他基本 R 替代方案(aggregate()tapply())。:

> with(df, ave(values, group1, FUN = mean))
 [1] -0.121033381  0.004803931  0.758597929 -0.121033381  0.004803931
 [6]  0.758597929 -0.121033381  0.004803931  0.758597929 -0.121033381
[11]  0.004803931  0.758597929

您可以使用 within()transform() 将此结果直接嵌入到 df 中:

> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean))
> head(df2)
  group1 group2     values    meanValue
1      1      A  0.4874291 -0.121033381
2      2      B  0.7383247  0.004803931
3      3      B  0.5757814  0.758597929
4      1      C -0.3053884 -0.121033381
5      2      A  1.5117812  0.004803931
6      3      C  0.3898432  0.758597929
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean))
> all.equal(df2,df3)
[1] TRUE

如果顺序很重要:

> head(df2[order(df2$group1, df2$group2), ])
   group1 group2      values    meanValue
1       1      A  0.48742905 -0.121033381
10      1      A -0.04493361 -0.121033381
4       1      C -0.30538839 -0.121033381
7       1      C -0.62124058 -0.121033381
5       2      A  1.51178117  0.004803931
11      2      A -0.01619026  0.004803931

I think ave() is more useful here than the plyr call you show (I'm not familiar enough with plyr to know if you can do what you want with plyr directly or not, I would be surprised if you can't!) or the other base R alternatives (aggregate(), tapply()).:

> with(df, ave(values, group1, FUN = mean))
 [1] -0.121033381  0.004803931  0.758597929 -0.121033381  0.004803931
 [6]  0.758597929 -0.121033381  0.004803931  0.758597929 -0.121033381
[11]  0.004803931  0.758597929

You can use within() or transform() to embed this result directly into df:

> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean))
> head(df2)
  group1 group2     values    meanValue
1      1      A  0.4874291 -0.121033381
2      2      B  0.7383247  0.004803931
3      3      B  0.5757814  0.758597929
4      1      C -0.3053884 -0.121033381
5      2      A  1.5117812  0.004803931
6      3      C  0.3898432  0.758597929
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean))
> all.equal(df2,df3)
[1] TRUE

And if the ordering is important:

> head(df2[order(df2$group1, df2$group2), ])
   group1 group2      values    meanValue
1       1      A  0.48742905 -0.121033381
10      1      A -0.04493361 -0.121033381
4       1      C -0.30538839 -0.121033381
7       1      C -0.62124058 -0.121033381
5       2      A  1.51178117  0.004803931
11      2      A -0.01619026  0.004803931
栀梦 2024-10-25 01:30:58

在性能方面,您可以使用 data.table< 执行相同类型的操作/code>包,它内置了聚合,并且由于索引和基于 C 的实现而速度非常快。例如,给定的 df 已存在于您的示例中:
<代码>

library("data.table")
dt<-as.data.table(df)
setkey(dt,group1)
dt<-dt[,list(group2,values,meanValue=mean(values)),by=group1]
dt
      group1 group2      values   meanValue
 [1,]      1      A  0.82122120  0.18810771
 [2,]      1      C  0.78213630  0.18810771
 [3,]      1      C  0.61982575  0.18810771
 [4,]      1      A -1.47075238  0.18810771
 [5,]      2      B  0.59390132  0.03354688
 [6,]      2      A  0.07456498  0.03354688
 [7,]      2      B -0.05612874  0.03354688
 [8,]      2      A -0.47815006  0.03354688
 [9,]      3      B  0.91897737 -0.20205707
[10,]      3      C -1.98935170 -0.20205707
[11,]      3      B -0.15579551 -0.20205707
[12,]      3      A  0.41794156 -0.20205707

<代码>

我还没有对它进行基准测试,但根据我的经验,它要快得多。

如果您决定走 data.table 道路(我认为如果您使用大型数据集,这是值得探索的),您确实需要阅读文档,因为如果您不知道数据框架之间存在一些差异,这些差异可能会困扰您他们。然而,值得注意的是,data.table 通常可以与任何需要数据帧的函数一起使用,因为 data.table 会声明其类型是数据帧(数据表继承自数据帧)。

[ 2011 年 2 月 ]


[ 2012 年 8 月 ] Matthew 更新:

2012 年 7 月发布到 CRAN 的 v1.8.2 中的新功能是 := by group。这与上面的答案非常相似,但是通过引用将新列添加到dt,因此没有副本,也不需要合并步骤或重新列出现有列来返回与总量并列。无需先setkey,它可以处理不连续的组(即未分组在一起的组)。

对于大型数据集来说,这要快得多,并且语法简单而简短:

dt <- as.data.table(df)
dt[, meanValue := mean(values), by = group1]

In terms of performance, you can do this same kind of operation using the data.table package, which has built in aggregation and is very fast thanks to indices and a C based implementation. For instance, given df already exists from your example:

library("data.table")
dt<-as.data.table(df)
setkey(dt,group1)
dt<-dt[,list(group2,values,meanValue=mean(values)),by=group1]
dt
      group1 group2      values   meanValue
 [1,]      1      A  0.82122120  0.18810771
 [2,]      1      C  0.78213630  0.18810771
 [3,]      1      C  0.61982575  0.18810771
 [4,]      1      A -1.47075238  0.18810771
 [5,]      2      B  0.59390132  0.03354688
 [6,]      2      A  0.07456498  0.03354688
 [7,]      2      B -0.05612874  0.03354688
 [8,]      2      A -0.47815006  0.03354688
 [9,]      3      B  0.91897737 -0.20205707
[10,]      3      C -1.98935170 -0.20205707
[11,]      3      B -0.15579551 -0.20205707
[12,]      3      A  0.41794156 -0.20205707

I have not benchmarked it, but in my experience it is a lot faster.

If you decide to go down the data.table road, which I think is worth exploring if you work with large data sets, you really need to read the docs because there are some differences from data frame that can bite you if you are unaware of them. However, notably data.table generally does work with any function expecting a data frame,as a data.table will claim its type is data frame (data table inherits from data frame).

[ Feb 2011 ]


[ Aug 2012 ] Update from Matthew :

New in v1.8.2 released to CRAN in July 2012 is := by group. This is very similar to the answer above, but adds the new column by reference to dt so there is no copy and no need for a merge step or relisting existing columns to return alongside the aggregate. There is no need to setkey first, and it copes with non-contiguous groups (i.e. groups that aren't grouped together).

This is signficantly faster for large datasets, and has a simple and short syntax :

dt <- as.data.table(df)
dt[, meanValue := mean(values), by = group1]
过度放纵 2024-10-25 01:30:58

您不能将 x 添加到传递给 ddply 的函数中吗?

df <- ddply( df, "group1", function(x)
             data.frame( x, meanValue = mean(x$values) ) )

Can't you just add x to the function you pass to ddply?

df <- ddply( df, "group1", function(x)
             data.frame( x, meanValue = mean(x$values) ) )
彩扇题诗 2024-10-25 01:30:58

dplyr 可能性:

library(dplyr)
df %>% 
  group_by(group1) %>%
  mutate(meanValue = mean(values))

这会按原始顺序返回数据帧。如果您希望按“group1”排序,请将 arrange(group1) 添加到管道中。

A dplyr possibility:

library(dplyr)
df %>% 
  group_by(group1) %>%
  mutate(meanValue = mean(values))

This returns the data frame in the original order. Add arrange(group1) to the pipe if you wish to order by "group1".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文