ddply 返回太多结果
由于某种原因,自从升级到 R-2.13.0 以及升级到 plyr_1.5.1.tar.gz 以来,我得到的结果比我预期的要多...我在旧版本的 plyr 上尝试了这个(不幸的是版本不确定,因为我刚刚覆盖了它...)
library(plyr)
dd <-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),
rep("C",24)),c(rep("J",36),rep("K",36)))
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2")
results1 <- ddply(dd, c("dim1","dim2"), function(df) c(m1=mean(df$v1)) )
results2 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
m2=mean(df$v2)) } )
results3 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
m2=mean(df$v2), m3=mean(df$v3)) } )
我不明白为什么结果 2 的行数是结果 1 中的行数的两倍,而结果 3 的行数是结果 3 的三倍 - 其中原始结果 1 只是复制了两次或三次。
我有一份 R 版本 2.11.0 Patched (2010-05-01 r51907) 的方便副本,使用旧版本的 plyr,我期望的结果是......
> results1
dim1 dim2 m1
1 A J 0.07312783
2 B J -0.22428746
3 B K -0.44205832
4 C K 0.21421456
> results2
dim1 dim2 m1 m2
1 A J 0.07312783 -0.1130148
2 B J -0.22428746 0.4394832
3 B K -0.44205832 -0.1934018
4 C K 0.21421456 -0.0178809
> results3
dim1 dim2 m1 m2 m3
1 A J 0.07312783 -0.1130148 -0.03175873
2 B J -0.22428746 0.4394832 0.21581696
3 B K -0.44205832 -0.1934018 -0.28313530
4 C K 0.21421456 -0.0178809 -0.21948430
我从 R 版本 2.13.0 (2011-04- 13)
> results1
dim1 dim2 m1
1 A J -0.2270726
2 B J 0.5860493
3 B K -0.5986129
4 C K 0.3135809
> results2
dim1 dim2 m1 m2
1 A J -0.2270726 -0.19037813
2 B J 0.5860493 -0.05385395
3 B K -0.5986129 0.29404095
4 C K 0.3135809 -0.26744010
5 A J -0.2270726 -0.19037813
6 B J 0.5860493 -0.05385395
7 B K -0.5986129 0.29404095
8 C K 0.3135809 -0.26744010
> results3
dim1 dim2 m1 m2 m3
1 A J -0.2270726 -0.19037813 -0.20448734
2 B J 0.5860493 -0.05385395 -0.11190857
3 B K -0.5986129 0.29404095 -0.27072101
4 C K 0.3135809 -0.26744010 -0.03184949
5 A J -0.2270726 -0.19037813 -0.20448734
6 B J 0.5860493 -0.05385395 -0.11190857
7 B K -0.5986129 0.29404095 -0.27072101
8 C K 0.3135809 -0.26744010 -0.03184949
9 A J -0.2270726 -0.19037813 -0.20448734
10 B J 0.5860493 -0.05385395 -0.11190857
11 B K -0.5986129 0.29404095 -0.27072101
12 C K 0.3135809 -0.26744010 -0.03184949
为什么 results2 有 8 行而不是 4 行,而 results3 有 12 行而不是 4 行?
谢谢, 肖恩
For some reason I'm getting more results than I expected since the upgrade to R-2.13.0 - and the upgrade to plyr_1.5.1.tar.gz... I tried this on an old version of plyr (version unsure unfortunately as I've just overwritten it...)
library(plyr)
dd <-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),
rep("C",24)),c(rep("J",36),rep("K",36)))
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2")
results1 <- ddply(dd, c("dim1","dim2"), function(df) c(m1=mean(df$v1)) )
results2 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
m2=mean(df$v2)) } )
results3 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
m2=mean(df$v2), m3=mean(df$v3)) } )
I don't understand why results 2 has twice the number of rows in results1 and results3 has three times as many - where the original results1 is just replicated twice or three times.
I had a handy copy of R version 2.11.0 Patched (2010-05-01 r51907) using an old version of plyr the results I was expecting were...
> results1
dim1 dim2 m1
1 A J 0.07312783
2 B J -0.22428746
3 B K -0.44205832
4 C K 0.21421456
> results2
dim1 dim2 m1 m2
1 A J 0.07312783 -0.1130148
2 B J -0.22428746 0.4394832
3 B K -0.44205832 -0.1934018
4 C K 0.21421456 -0.0178809
> results3
dim1 dim2 m1 m2 m3
1 A J 0.07312783 -0.1130148 -0.03175873
2 B J -0.22428746 0.4394832 0.21581696
3 B K -0.44205832 -0.1934018 -0.28313530
4 C K 0.21421456 -0.0178809 -0.21948430
The results I get from R version 2.13.0 (2011-04-13)
> results1
dim1 dim2 m1
1 A J -0.2270726
2 B J 0.5860493
3 B K -0.5986129
4 C K 0.3135809
> results2
dim1 dim2 m1 m2
1 A J -0.2270726 -0.19037813
2 B J 0.5860493 -0.05385395
3 B K -0.5986129 0.29404095
4 C K 0.3135809 -0.26744010
5 A J -0.2270726 -0.19037813
6 B J 0.5860493 -0.05385395
7 B K -0.5986129 0.29404095
8 C K 0.3135809 -0.26744010
> results3
dim1 dim2 m1 m2 m3
1 A J -0.2270726 -0.19037813 -0.20448734
2 B J 0.5860493 -0.05385395 -0.11190857
3 B K -0.5986129 0.29404095 -0.27072101
4 C K 0.3135809 -0.26744010 -0.03184949
5 A J -0.2270726 -0.19037813 -0.20448734
6 B J 0.5860493 -0.05385395 -0.11190857
7 B K -0.5986129 0.29404095 -0.27072101
8 C K 0.3135809 -0.26744010 -0.03184949
9 A J -0.2270726 -0.19037813 -0.20448734
10 B J 0.5860493 -0.05385395 -0.11190857
11 B K -0.5986129 0.29404095 -0.27072101
12 C K 0.3135809 -0.26744010 -0.03184949
why has results2 got 8 rows instead of 4 and results3 got 12 rows instead of 4?
Thanks,
Sean
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个问题很快就会在
plyr 1.5.2
中得到修复This will be fixed shortly in
plyr 1.5.2
导致问题的是 ddply() 中的 c() 函数。
您可以通过以下三种替代方法来编写结果语句,并逐渐变得更简单:
在函数内使用 data.frame:
ddply(dd, c("dim1","dim2"), 函数(df) {data.frame(m1=mean(df$v1),
m2=mean(df$v2), m3=mean(df$v3)) } )
使用摘要:
ddply(dd, .(dim1, dim2), summarise, m1=mean(v1), m2=mean(v2), m3=mean(v3))
使用 numcolwise。
ddply(dd, .(dim1, dim2), numcolwise(mean))
在每种情况下,结果都是您所期望的:
It's the c() function inside your ddply() that's causing the problem.
Here are three alternative ways that you can write your statement for results3, progressively getting simpler:
Use data.frame inside your function:
ddply(dd, c("dim1","dim2"), function(df) {data.frame(m1=mean(df$v1),
m2=mean(df$v2), m3=mean(df$v3)) } )
Use summarise:
ddply(dd, .(dim1, dim2), summarise, m1=mean(v1), m2=mean(v2), m3=mean(v3))
Use numcolwise.
ddply(dd, .(dim1, dim2), numcolwise(mean))
In each case the results are what you would expect: