用聚合解决 ddply 任务的优雅方法（希望有更好的性能）

发布于 2024-12-21 04:53:29 字数 1074 浏览 2 评论 0原文

我想通过名为 ensg 的标识符变量聚合 data.frame。数据框如下所示：

  chromosome probeset               ensg symbol    XXA_00    XXA_36    XXB_00
1          X  4938842 ENSMUSG00000000003   Pbsn  4.796123  4.737717  5.326664

我想计算具有相同 ensg 值的行上每个数字列的平均值。这里的问题是我想保持其他身份变量染色体和符号不变，因为它们对于相同的 ensg 也是相同的。

最后我想要一个带有标识列chromosome、ensg、symbol和mean的data.frame具有相同标识符的数字列覆盖行。我在 ddply 中实现了这一点，但与聚合相比它非常慢：

spec.mean <- function(eset.piece)
  {
    cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
  }
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")

我的第一个聚合实现如下所示，

mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);

并且速度要快得多。但聚合的问题是我必须重新附加描述变量。我还没有弄清楚如何将我的自定义函数与aggregate一起使用，因为aggregate不传递数据帧，而只传递向量。

有没有一种优雅的方法可以使用aggregate来做到这一点？或者有没有更快的方法来使用 ddply 来做到这一点？

原文

I would like to aggregate a data.frame by an identifier variable called ensg. The data frame looks like this:

  chromosome probeset               ensg symbol    XXA_00    XXA_36    XXB_00
1          X  4938842 ENSMUSG00000000003   Pbsn  4.796123  4.737717  5.326664

I want to compute the mean for each numeric column over rows with same ensg value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg.

In the end I would like to have a data.frame with identity columns chromosome, ensg, symbol and mean of numeric columns over rows with same identifier. I implemented this in ddply, but it is very slow when compared to aggregate:

spec.mean <- function(eset.piece)
  {
    cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
  }
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")

My first aggregate implementation looks like this,

mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);

and is much faster. But the problem with aggregate is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate since aggregate does not pass data frames but only vectors.

Is there an elegant way to do this with aggregate? Or is there some faster way to do it with ddply?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

执笏见 2024-12-28 04:53:29

如果速度是主要考虑因素，您应该查看 data.table 包。当行数或分组列数很大时，data.table 看起来真的很闪耀。该包的 wiki 位于此处并且有几个链接其他好的介绍性文件。

以下是使用 data.table() 进行聚合的方法。

library(data.table)
#Turn the data.frame above into a data.table
dt <- data.table(df)
#Aggregation

  dt[, list(XXA_00 = .Internal(mean(XXA_00)),
          XXA_36 = .Internal(mean(XXA_36)),
          XXB_00 = .Internal(mean(XXB_00))),
    by = c("ensg", "chromosome", "symbol")
   ]

通过

     ensg chromosome symbol      XXA_00      XXA_36    XXB_00
[1,]   E1          A     S1  0.18026869  0.13118997 0.6558433
[2,]   E2          B     S2 -0.48830539  0.24235537 0.5971377
[3,]   E3          C     S3 -0.04786984 -0.03139901 0.5618208

比较 rbenchmark 包。然而，当 data.frame 包含 3e5 行时，data.table() 成为明显的赢家。这是输出：

 benchmark(fag(), fdt(), replications = 10)
   test replications elapsed relative user.self sys.self 
1 fag()           10   12.71 23.98113     12.40     0.31     
2 fdt()           10    0.53  1.00000      0.48     0.05

If speed is a primary concern, you should take a look at the data.table package. When the number of rows or grouping columns is large, data.table really seems to shine. The wiki for the package is here and has several links to other good introductory documents.

Here's how you'd do this aggregation with data.table()

library(data.table)
#Turn the data.frame above into a data.table
dt <- data.table(df)
#Aggregation

  dt[, list(XXA_00 = .Internal(mean(XXA_00)),
          XXA_36 = .Internal(mean(XXA_36)),
          XXB_00 = .Internal(mean(XXB_00))),
    by = c("ensg", "chromosome", "symbol")
   ]

Gives us

     ensg chromosome symbol      XXA_00      XXA_36    XXB_00
[1,]   E1          A     S1  0.18026869  0.13118997 0.6558433
[2,]   E2          B     S2 -0.48830539  0.24235537 0.5971377
[3,]   E3          C     S3 -0.04786984 -0.03139901 0.5618208

The aggregate solution provided above seems to fare pretty well when working with the 30 row data.frame by comparing the output from the rbenchmark package. However, when the data.frame contains 3e5 rows, data.table() pulls away as a clear winner. Here's the output:

 benchmark(fag(), fdt(), replications = 10)
   test replications elapsed relative user.self sys.self 
1 fag()           10   12.71 23.98113     12.40     0.31     
2 fdt()           10    0.53  1.00000      0.48     0.05

回复收藏 0 原文

风吹雨成花 2024-12-28 04:53:29

首先让我们定义一个玩具示例：

df <- data.frame(chromosome = gl(3,  10,  labels = c('A',  'B',  'C')),
             probeset = gl(3,  10,  labels = c('X',  'Y',  'Z')),
             ensg =  gl(3,  10,  labels = c('E1',  'E2',  'E3')),
             symbol = gl(3,  10,  labels = c('S1',  'S2',  'S3')),
             XXA_00 = rnorm(30),
             XXA_36 = rnorm(30),
             XXB_00 = rnorm(30))

然后我们将 aggregate 与公式接口一起使用：

df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,  
    data = df,  FUN = mean)

> df1
  ensg chromosome symbol      XXA_00      XXA_36      XXB_00
1   E1          A     S1 -0.02533499 -0.06150447 -0.01234508
2   E2          B     S2 -0.25165987  0.02494902 -0.01116426
3   E3          C     S3  0.09454154 -0.48468517 -0.25644569

First let's define a toy example:

df <- data.frame(chromosome = gl(3,  10,  labels = c('A',  'B',  'C')),
             probeset = gl(3,  10,  labels = c('X',  'Y',  'Z')),
             ensg =  gl(3,  10,  labels = c('E1',  'E2',  'E3')),
             symbol = gl(3,  10,  labels = c('S1',  'S2',  'S3')),
             XXA_00 = rnorm(30),
             XXA_36 = rnorm(30),
             XXB_00 = rnorm(30))

And then we use aggregate with the formula interface:

df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,  
    data = df,  FUN = mean)

> df1
  ensg chromosome symbol      XXA_00      XXA_36      XXB_00
1   E1          A     S1 -0.02533499 -0.06150447 -0.01234508
2   E2          B     S2 -0.25165987  0.02494902 -0.01116426
3   E3          C     S3  0.09454154 -0.48468517 -0.25644569

回复收藏 0 原文

~没有更多了~