使用 ddply 进行汇总统计
我喜欢使用 ddply 编写一个函数,根据 data.frame
mat
两列的名称输出摘要统计信息。
mat
是一个大的data.frame
,其列名称为“metric”、“length”、“species”、“tree”、.. .,"index"
index
是具有 2 个级别的因子"Short"、"Long"
"metric", "length", "species", "tree"
等都是连续变量
功能:
summary1 <- function(arg1,arg2) {
...
ss <- ddply(mat, .(index), function(X) data.frame(
arg1 = as.list(summary(X$arg1)),
arg2 = as.list(summary(X$arg2)),
.parallel = FALSE)
ss
}
我希望调用 summary1("metric","length")
后的输出看起来像这样,
Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
目前该函数不会产生所需的输出?这里应该做哪些修改呢?
感谢您的帮助。
这是一个玩具示例
mat <- data.frame(
metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)
I like to write a function using ddply
that outputs the summary statistics based on the name of two columns of data.frame
mat
.
mat
is a bigdata.frame
with the name of columns"metric", "length", "species", "tree", ...,"index"
index
is factor with 2 levels"Short", "Long"
"metric", "length", "species", "tree"
and others are all continuous variables
Function:
summary1 <- function(arg1,arg2) {
...
ss <- ddply(mat, .(index), function(X) data.frame(
arg1 = as.list(summary(X$arg1)),
arg2 = as.list(summary(X$arg2)),
.parallel = FALSE)
ss
}
I expect the output to look like this after calling summary1("metric","length")
Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
At the moment the function does not produce the desired output? What modification should be made here?
Thanks for your help.
Here is a toy example
mat <- data.frame(
metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如 Nick 在他的回答中写道,你不能使用
$< /code> 引用作为角色名称传递的变量。当您编写
X$arg1
时,R
在data.frame
中搜索名为"arg1"
的列X。您可以通过
X[,arg1]
或X[[arg1]]
引用它。如果您想要良好命名的输出,我建议以下解决方案:
玩具数据的输出是:
As Nick wrote in his answer you can't use
$
to reference variable passed as character name. When you wroteX$arg1
thenR
search for column named"arg1"
indata.frame
X
. You can reference to it either byX[,arg1]
orX[[arg1]]
.And if you want nicely named output I propose below solution:
Output for toy data is:
这更像你想要的吗?
Is this more like what you want?
由于 ddply 早已过时,skimr 是一种获取分组汇总统计信息的快速方法:
显示的汇总统计信息(如中位数)可以通过传递到
skim_with
中的sfl()
进行自定义> 工厂。生成的摘要是基于分组变量
index
的高形式。这比许多宽格式的摘要列更适合使用。您还可以获取摘要数据框而不是打印的文本摘要。As ddply is long outdated now, skimr is a quick way to get grouped summary statistics:
The summary statistics shown, like median, can be customized with
sfl()
passed into theskim_with
factory.The resulting summary is in tall form based on grouping variable
index
. This is better to work with than many summary columns in a wide format. You can also get the summary dataframe instead of the printed text summary.