构建均值/方差汇总表的快速/优雅的方法
我可以完成这个任务,但我觉得必须有一个“最好的”(最灵活、最紧凑、最清晰的代码、最快?)的方法来完成它,但到目前为止还没有弄清楚......
对于一组指定的分类因素 我想按组构建均值和方差表。
生成数据:
set.seed(1001)
d <- expand.grid(f1=LETTERS[1:3],f2=letters[1:3],
f3=factor(as.character(as.roman(1:3))),rep=1:4)
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
所需输出:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.09537958
2 A a II 0.4876630 0.11079670
3 A a III 0.3102926 0.20280568
4 A b I 0.3914084 0.05869310
5 A b II 0.5257355 0.21863126
6 A b III 0.3356860 0.07943314
... etc. ...
使用聚合
/合并
:
library(reshape)
m1 <- aggregate(y~f1*f2*f3,data=d,FUN=mean)
m2 <- aggregate(y~f1*f2*f3,data=d,FUN=var)
mvtab <- merge(rename(m1,c(y="y.mean")),
rename(m2,c(y="y.var")))
使用ddply
/summarise
(可能是最好的,但未能使其发挥作用):
mvtab2 <- ddply(subset(d,select=-c(z,rep)),
.(f1,f2,f3),
summarise,numcolwise(mean),numcolwise(var))
导致
Error in output[[var]][rng] <- df[[var]] :
incompatible types (from closure to logical) in subassignment type fix
使用melt
/cast
(也许最好?)
mvtab3 <- cast(melt(subset(d,select=-c(z,rep)),
id.vars=1:3),
...~.,fun.aggregate=c(mean,var))
## now have to drop "variable"
mvtab3 <- subset(mvtab3,select=-variable)
## also should rename response variables
不会(?)在reshape2
中工作。向某人解释 ...~.
可能很棘手!
I can achieve this task, but I feel like there must be a "best" (slickest, most compact, clearest-code, fastest?) way of doing it and have not figured it out so far ...
For a specified set of categorical factors I want to construct a table of means and variances by group.
generate data:
set.seed(1001)
d <- expand.grid(f1=LETTERS[1:3],f2=letters[1:3],
f3=factor(as.character(as.roman(1:3))),rep=1:4)
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
desired output:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.09537958
2 A a II 0.4876630 0.11079670
3 A a III 0.3102926 0.20280568
4 A b I 0.3914084 0.05869310
5 A b II 0.5257355 0.21863126
6 A b III 0.3356860 0.07943314
... etc. ...
using aggregate
/merge
:
library(reshape)
m1 <- aggregate(y~f1*f2*f3,data=d,FUN=mean)
m2 <- aggregate(y~f1*f2*f3,data=d,FUN=var)
mvtab <- merge(rename(m1,c(y="y.mean")),
rename(m2,c(y="y.var")))
using ddply
/summarise
(possibly best but haven't been able to make it work):
mvtab2 <- ddply(subset(d,select=-c(z,rep)),
.(f1,f2,f3),
summarise,numcolwise(mean),numcolwise(var))
results in
Error in output[[var]][rng] <- df[[var]] :
incompatible types (from closure to logical) in subassignment type fix
using melt
/cast
(maybe best?)
mvtab3 <- cast(melt(subset(d,select=-c(z,rep)),
id.vars=1:3),
...~.,fun.aggregate=c(mean,var))
## now have to drop "variable"
mvtab3 <- subset(mvtab3,select=-variable)
## also should rename response variables
Won't (?) work in reshape2
. Explaining ...~.
to someone could be tricky!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我有点疑惑。这不起作用:
这给了我这样的东西:
这是正确的形式,但看起来值与您指定的值不同。
编辑
以下是如何使您的版本与
numcolwise
工作:您忘记将实际数据传递给
numcolwise
。然后还有一个ddply
小技巧,每个部分在内部称为piece
。 (Hadley 在评论中指出不应依赖这一点,因为它可能会在plyr
的未来版本中发生变化。)I'm a bit puzzled. Does this not work:
This give me something like this:
Which is in the right form, but it looks like the values are different that what you specified.
Edit
Here's how to make your version with
numcolwise
work:You forgot to pass the actual data to
numcolwise
. And then there's the littleddply
trick that each piece is calledpiece
internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions ofplyr
.)(我投票给约书亚的。)这是一个 Hmisc::summary.formula 解决方案。对我来说,这样做的优点是它与 Hmisc::latex 输出“通道”很好地集成。
剪切输出以显示乳胶 -> PDF-> png 输出:
(I voted for Joshua's.) Here's an Hmisc::summary.formula solution. The advantage of this for me is that it is well integrated with the Hmisc::latex output "channel".
snipped output to show the latex -> PDF -> png output:
@joran 的回答非常准确。以下是我如何使用
aggregate
来做到这一点。请注意,我避免使用公式界面(它速度较慢)。@joran is spot-on with the
ddply
answer. Here's how I would do it withaggregate
. Note that I avoid the formula interface (it is slower).我有点沉迷于速度比较,尽管在这种情况下它们在很大程度上与我无关......
aggregate
是最快的(甚至比data.table
更快,这是令我惊讶的是,尽管使用更大的表来聚合情况可能会有所不同),甚至使用公式界面...)(现在我只需要 Dirk 站出来并发布一个
Rcpp
解决方案,即比其他任何东西都快1000倍...)I'm slightly addicted to speed comparisons even though they're largely irrelevant for me in this situation ...
aggregate
is fastest (even faster thandata.table
, which is a surprise to me, although things might be different with a bigger table to aggregate), even using the formula interface ...)(Now I just need Dirk to step up and post an
Rcpp
solution that is 1000 times faster than anything else ...)我发现 doBy 包 有一些非常方便的功能,例如这。例如,函数 ?summaryBy 非常方便。考虑一下:
所以函数调用很简单,易于使用,而且我想说,很优雅。
现在,如果您主要关心的是速度,那么这似乎是合理的 - 至少对于较小规模的任务而言(请注意,无论出于何种原因,我都无法使 ramnath_datatable 函数正常工作):
I find the doBy package has some very convenient functions for things like this. For example, the function ?summaryBy is quite handy. Consider:
So the function call is simple, easy to use, and I would say, elegant.
Now, if your primary concern is speed, it seems that it would be reasonable--at least with smaller sized tasks (note that I couldn't get the
ramnath_datatable
function to work for whatever reason):我遇到过这个问题,发现基准测试是用小表完成的,因此很难判断哪种方法对于 100 行更好。
我还对数据进行了一些修改,使其“未排序”,这将是更常见的情况,例如数据位于数据库中。
我添加了一些 data.table 试验,以查看预先设置密钥是否更快。在这里看来,预先设置密钥并不能提高太多性能,所以 ramnath 解决方案似乎是最快的。
I've came accross with this question and found the benchmarks are done with small tables, so it's hard to tell which method is better with 100 rows.
I've also modified the data a bit also to make it "unsorted", this would be a more common case, for example as the data is in a DB.
I've added a few more data.table trials to see if setting a key is faster beforehand. It seems here, setting the key beforehand doesn't improve much the performance, so ramnath solution seems to be the fastest.
这是使用 Hadley Wickham 的新 dplyr 库的解决方案。
And here is a solution using Hadley Wickham's new
dplyr
library.这是使用
data.table
的解决方案Here is a solution using
data.table