如何加快summary和ddply的速度？

发布于 2024-10-20 23:55:19 字数 1188 浏览 2 评论 0原文

我有一个包含 200 万行和 15 列的数据框。我想用 ddply 对其中 3 列进行分组（所有 3 列都是因子，这些因子有 780,000 个独特的组合），并获得 3 列的加权平均值（权重由我的数据集定义）。以下相当快：

system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean))
   user  system elapsed 
 91.358   4.747 115.727

问题是我想使用weighted.mean而不是mean来计算我的聚合列。

如果我在同一个数据帧上尝试以下 ddply（注意，我转换为不可变），则以下操作不会在 20 分钟后完成：

x <- ddply(idata.frame(aggdf), 
       c("fac1","fac2","fac3"), 
       summarise, 
       w=sum(w), 
       col1=weighted.mean(col1, w), 
       col2=weighted.mean(col2, w),
       col3=weighted.mean(col3, w))

此操作似乎很耗 CPU，但并不非常占用 RAM。

编辑：所以我最终编写了这个小函数，它通过利用加权平均值的某些属性来“作弊”一点，并对整个对象而不是切片进行乘法和除法。

weighted_mean_cols <- function(df, bycols, aggcols, weightcol) {
    df[,aggcols] <- df[,aggcols]*df[,weightcol]
    df <- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum)
    df[,aggcols] <- df[,aggcols]/df[,weightcol]
    df
}

当我运行时：

a2 <- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w")

我获得了良好的性能，以及可重用的、优雅的代码。

原文

I have a data frame with 2 million rows, and 15 columns. I want to group by 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors), and get the weighted mean of 3 columns (with weights defined by my data set). The following is reasonably quick:

system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean))
   user  system elapsed 
 91.358   4.747 115.727

The problem is that I want to use weighted.mean instead of mean to calculate my aggregate columns.

If I try the following ddply on the same data frame (note, I cast to immutable), the following does not finish after 20 minutes:

x <- ddply(idata.frame(aggdf), 
       c("fac1","fac2","fac3"), 
       summarise, 
       w=sum(w), 
       col1=weighted.mean(col1, w), 
       col2=weighted.mean(col2, w),
       col3=weighted.mean(col3, w))

This operation seems to be CPU hungry, but not very RAM-intensive.

EDIT:
So I ended up writing this little function, which "cheats" a bit by taking advantage of some properties of weighted mean and does a multiplication and a division on the whole object, rather than on the slices.

weighted_mean_cols <- function(df, bycols, aggcols, weightcol) {
    df[,aggcols] <- df[,aggcols]*df[,weightcol]
    df <- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum)
    df[,aggcols] <- df[,aggcols]/df[,weightcol]
    df
}

When I run as:

a2 <- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w")

I get good performance, and somewhat reusable, elegant code.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南风起 2024-10-27 23:55:19

尽管 ddply 的代码优雅性和易用性是无与伦比的，但我发现对于大数据，tapply 速度要快得多。对于你的情况，我会

do.call("cbind", list((w <- tapply(..)), tapply(..)))

为这些点和可能对问题的错误理解使用抱歉；但我有点赶时间，必须在大约零下五分钟内赶上公共汽车！

Though ddply is hard to beat for elegance and ease of code, I find that for big data, tapply is much faster. In your case, I would use a

do.call("cbind", list((w <- tapply(..)), tapply(..)))

Sorry for the dots and possibly faulty understanding of the question; but I am in a bit of a rush and must catch a bus in about minus five minutes!

回复收藏 0 原文

梦晓ヶ微光ヅ倾城 2024-10-27 23:55:19

如果您要使用您的编辑，为什么不使用 rowsum 来节省几分钟的执行时间呢？

nr <- 2e6
nc <- 3
aggdf <- data.frame(matrix(rnorm(nr*nc),nr,nc),
                    matrix(sample(100,nr*nc,TRUE),nr,nc), rnorm(nr))
colnames(aggdf) <- c("col1","col2","col3","fac1","fac2","fac3","w")

system.time({
aggsums <- rowsum(data.frame(aggdf[,c("col1","col2","col3")]*aggdf$w,w=aggdf$w), 
  interaction(aggdf[,c("fac1","fac2","fac3")]))
agg_wtd_mean <- aggsums[,1:3]/aggsums[,4]
})
#   user  system elapsed 
#  16.21    0.77   16.99

If you're going to use your edit, why not use rowsum and save yourself a few minutes of execution time?

nr <- 2e6
nc <- 3
aggdf <- data.frame(matrix(rnorm(nr*nc),nr,nc),
                    matrix(sample(100,nr*nc,TRUE),nr,nc), rnorm(nr))
colnames(aggdf) <- c("col1","col2","col3","fac1","fac2","fac3","w")

system.time({
aggsums <- rowsum(data.frame(aggdf[,c("col1","col2","col3")]*aggdf$w,w=aggdf$w), 
  interaction(aggdf[,c("fac1","fac2","fac3")]))
agg_wtd_mean <- aggsums[,1:3]/aggsums[,4]
})
#   user  system elapsed 
#  16.21    0.77   16.99

回复收藏 0 原文

~没有更多了~