如何加快summary和ddply的速度?
我有一个包含 200 万行和 15 列的数据框。我想用 ddply 对其中 3 列进行分组(所有 3 列都是因子,这些因子有 780,000 个独特的组合),并获得 3 列的加权平均值(权重由我的数据集定义)。以下相当快:
system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean))
user system elapsed
91.358 4.747 115.727
问题是我想使用weighted.mean而不是mean来计算我的聚合列。
如果我在同一个数据帧上尝试以下 ddply(注意,我转换为不可变),则以下操作不会在 20 分钟后完成:
x <- ddply(idata.frame(aggdf),
c("fac1","fac2","fac3"),
summarise,
w=sum(w),
col1=weighted.mean(col1, w),
col2=weighted.mean(col2, w),
col3=weighted.mean(col3, w))
此操作似乎很耗 CPU,但并不非常占用 RAM。
编辑: 所以我最终编写了这个小函数,它通过利用加权平均值的某些属性来“作弊”一点,并对整个对象而不是切片进行乘法和除法。
weighted_mean_cols <- function(df, bycols, aggcols, weightcol) {
df[,aggcols] <- df[,aggcols]*df[,weightcol]
df <- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum)
df[,aggcols] <- df[,aggcols]/df[,weightcol]
df
}
当我运行时:
a2 <- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w")
我获得了良好的性能,以及可重用的、优雅的代码。
I have a data frame with 2 million rows, and 15 columns. I want to group by 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors), and get the weighted mean of 3 columns (with weights defined by my data set). The following is reasonably quick:
system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean))
user system elapsed
91.358 4.747 115.727
The problem is that I want to use weighted.mean instead of mean to calculate my aggregate columns.
If I try the following ddply on the same data frame (note, I cast to immutable), the following does not finish after 20 minutes:
x <- ddply(idata.frame(aggdf),
c("fac1","fac2","fac3"),
summarise,
w=sum(w),
col1=weighted.mean(col1, w),
col2=weighted.mean(col2, w),
col3=weighted.mean(col3, w))
This operation seems to be CPU hungry, but not very RAM-intensive.
EDIT:
So I ended up writing this little function, which "cheats" a bit by taking advantage of some properties of weighted mean and does a multiplication and a division on the whole object, rather than on the slices.
weighted_mean_cols <- function(df, bycols, aggcols, weightcol) {
df[,aggcols] <- df[,aggcols]*df[,weightcol]
df <- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum)
df[,aggcols] <- df[,aggcols]/df[,weightcol]
df
}
When I run as:
a2 <- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w")
I get good performance, and somewhat reusable, elegant code.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尽管 ddply 的代码优雅性和易用性是无与伦比的,但我发现对于大数据,tapply 速度要快得多。对于你的情况,我会
为这些点和可能对问题的错误理解使用抱歉;但我有点赶时间,必须在大约零下五分钟内赶上公共汽车!
Though
ddply
is hard to beat for elegance and ease of code, I find that for big data,tapply
is much faster. In your case, I would use aSorry for the dots and possibly faulty understanding of the question; but I am in a bit of a rush and must catch a bus in about minus five minutes!
如果您要使用您的编辑,为什么不使用
rowsum
来节省几分钟的执行时间呢?If you're going to use your edit, why not use
rowsum
and save yourself a few minutes of execution time?