对于每个组,总结数据框中所有变量的平均值(ddply?split?)
一周前,我会手动完成此操作:按组将数据帧子集到新数据帧。对于每个数据帧计算每个变量的平均值,然后进行 rbind。非常笨重...
现在我已经了解了 split
和 plyr
,我想一定有一种更简单的方法来使用这些工具。请不要证明我错了。
test_data <- data.frame(cbind(
var0 = rnorm(100),
var1 = rnorm(100,1),
var2 = rnorm(100,2),
var3 = rnorm(100,3),
var4 = rnorm(100,4),
group = sample(letters[1:10],100,replace=T),
year = sample(c(2007,2009),100, replace=T)))
test_data$var1 <- as.numeric(as.character(test_data$var1))
test_data$var2 <- as.numeric(as.character(test_data$var2))
test_data$var3 <- as.numeric(as.character(test_data$var3))
test_data$var4 <- as.numeric(as.character(test_data$var4))
我正在玩弄 ddply ,但我无法生成我想要的东西 - 即像这样的表格,对于每个组
group a |2007|2009|
________|____|____|
var1 | xx | xx |
var2 | xx | xx |
etc. | etc| ect|
可能是 d_ply 和一些 odfweave > 输出将起作用。非常感谢您的投入。
ps 我注意到 data.frame 将 rnorm 转换为我的 data.frame 中的因子?我怎样才能避免这种情况 - I(rnorm(100) 不起作用,所以我必须像上面那样转换为数字
A week ago I would have done this manually: subset dataframe by group to new dataframes. For each dataframe compute means for each variables, then rbind. very clunky ...
Now i have learned about split
and plyr
, and I guess there must be an easier way using these tools. Please don't prove me wrong.
test_data <- data.frame(cbind(
var0 = rnorm(100),
var1 = rnorm(100,1),
var2 = rnorm(100,2),
var3 = rnorm(100,3),
var4 = rnorm(100,4),
group = sample(letters[1:10],100,replace=T),
year = sample(c(2007,2009),100, replace=T)))
test_data$var1 <- as.numeric(as.character(test_data$var1))
test_data$var2 <- as.numeric(as.character(test_data$var2))
test_data$var3 <- as.numeric(as.character(test_data$var3))
test_data$var4 <- as.numeric(as.character(test_data$var4))
I am toying with both ddply
but I can't produce what I desire - i.e. a table like this, for each group
group a |2007|2009|
________|____|____|
var1 | xx | xx |
var2 | xx | xx |
etc. | etc| ect|
maybe d_ply
and some odfweave
output would work to. Inputs are very much appreciated.
p.s. I notice that data.frame converts the rnorm to factors in my data.frame? how can I avoid this - I(rnorm(100) doesn't work so I have to convert to numerics as done above
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
给定您想要的结果格式,reshape 包将比 plyr 更有效。
结果看起来像这样
Given the format you want for the result, the reshape package will be more efficient than plyr.
The result looks like this
您可以使用
by()
来完成此操作。首先设置一些数据:使用
by()
:您仍然需要为表重新格式化它,但它确实在一行中给出了答案的要点。
编辑:可以通过以下方式进行进一步处理。
您可以进一步使用
dimnames
:编辑:这样,我们就可以创建一个
>data.frame
即可获得结果,而无需仅使用基本 R 求助于外部包:You can do this with
by()
. First set up some data:Use
by()
:You still need to reformat this for your table but it does give you the gist of your answer in one line.
Edit: Further processing can be had via
You can play with the
dimnames
some more:Edit: And with that, we can create a
data.frame
for the result without resorting to external packages using only base R:编辑:我写了以下内容,然后意识到蒂埃里已经写了几乎完全相同的答案。我不知何故忽略了他的回答。因此,如果您喜欢这个答案,请投票支持他。我要继续发帖,因为我花了时间打字。
这类事情消耗了我比我希望的更多的时间!以下是使用 Hadley Wickham 的 reshape 包 的解决方案。此示例并没有完全执行您所要求的操作,因为结果都在一个大表中,而不是每个组的表中。
您在将数值显示为因子时遇到的麻烦是因为您使用的是 cbind 并且所有内容都被猛烈地撞击到字符类型的矩阵中。最酷的是你不需要使用 data.frame 进行 cbind。
结果如下:
我最近写了一篇博客文章,关于做类似的事情plyr。我应该写第 2 部分,了解如何使用 reshape 包做同样的事情。 plyr 和 reshape 都是由 Hadley Wickham 编写的,都是非常有用的工具。
EDIT: I wrote the following and then realized that Thierry had already written up almost EXACTLY the same answer. I somehow overlooked his answer. So if you like this answer, vote his up instead. I'm going ahead and posting since I spent the time typing it up.
This sort of stuff consumes way more of my time than I wish it did! Here's a solution using the reshape package by Hadley Wickham. This example does not do exactly what you asked because the results are all in one big table, not a table for each group.
The trouble you were having with the numeric values showing up as factors was because you were using cbind and everything was getting slammed into a matrix of type character. The cool thing is you don't need cbind with data.frame.
and this results in the following:
I wrote a blog post recently about doing something similar with plyr. I should do a part 2 about how to do the same thing using the reshape package. Both plyr and reshape were written by Hadley Wickham and are crazy useful tools.
可以使用基本的 R 函数来完成:
说明:
在 R 2.9.2 中,结果是:
根据我的随机数据,“a”组存在问题 - 仅存在 2007 个病例。如果年份是因素(水平为 2007 年和 2009 年),那么结果可能看起来更好(每年您将有两行,但可能存在 NA)。
结果是列表,因此您可以使用 lapply 例如。转换为 Latex 表、html 表、在屏幕上打印转置等。
It could be done with basic R function:
Explanations:
In R 2.9.2 result is:
With my random data there is problem with "a" group - only 2007 cases were present. If year will be factor (with levels 2007 and 2009) then results may look better (you will have two rows for each year, but there probably be NA).
Result is list, so you can use lapply to eg. convert to latex table, html table, print on screen transpose, etc.
首先,您不需要使用 cbind,这就是为什么一切都是一个因素。这是有效的:
其次,最佳实践是使用“.”。而不是变量名中的“_”。 请参阅 Google 风格指南(例如)。
最后,你可以使用Rigroup包;速度非常快。将 igroupMeans() 函数与 apply 结合起来,并设置索引
i=as.factor(paste(test_data$group,test_data$year,sep=""))
。稍后我将尝试提供一个示例。编辑 6/9/2017
Rigroup 软件包已从 CRAN 中删除。请参阅此
First of all, you don't need to use cbind, and that's why everything is a factor. This works:
Secondly, the best practice is to use "." instead of "_" in variable names. See the google style guide (for instance).
Finally, you can use the Rigroup package; it's very fast. Combine the igroupMeans() function with apply, and set the index
i=as.factor(paste(test_data$group,test_data$year,sep=""))
. I'll try to include an example of this later.EDIT 6/9/2017
Rigroup package was removed from CRAN. See this
首先做一个简单的聚合来总结一下。
这使得 data.frame 像这样......
这本身就非常接近你想要的。你现在可以按组将其分解。
好吧,这并不完全是这样,但如果您确实愿意,我们可以改进输出。
这没有您所有的表格格式,但它的组织方式与您描述的完全一样,并且非常接近。最后一步你可以按照自己喜欢的方式进行美化。
这是这里与所请求的组织相匹配的唯一答案,并且这是在 R 中执行此操作的最快方法。顺便说一句,我不会费心执行最后一步,只需坚持聚合的第一个输出...或者也许分裂。
First do a simple aggregate to get it summarized.
That makes a data.frame like this...
That, by itself, is pretty close to what you wanted. You could just break it up by group now.
OK, so that's not quite it but we can refine the output if you really want to.
That doesn't have all your table formatting but it's organized exactly as you describe and is darn close. This last step you could pretty up how you like.
This is the only answer here that matches the requested organization, and it's the fastest way to do it in R. BTW, I wouldn't bother doing that last step and just stick with the very first output from the aggregate... or maybe the split.