使用 ddply 进行汇总统计

发布于 2024-11-01 22:50:07 字数 1343 浏览 6 评论 0原文

我喜欢使用 ddply 编写一个函数，根据 data.frame mat 两列的名称输出摘要统计信息。

mat 是一个大的data.frame，其列名称为“metric”、“length”、“species”、“tree”、.. .,"index"
index 是具有 2 个级别的因子 "Short"、"Long"
"metric", "length", "species", "tree" 等都是连续变量

功能：

summary1 <- function(arg1,arg2) {
    ...

    ss <- ddply(mat, .(index), function(X) data.frame(
        arg1 = as.list(summary(X$arg1)),
        arg2 = as.list(summary(X$arg2)),
        .parallel = FALSE)

    ss
}

我希望调用 summary1("metric","length") 后的输出看起来像这样，

Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max. 

....

Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.

....

目前该函数不会产生所需的输出？这里应该做哪些修改呢？

感谢您的帮助。

这是一个玩具示例

mat <- data.frame(
    metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
    tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)

原文

I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat.

mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index"
index is factor with 2 levels "Short", "Long"
"metric", "length", "species", "tree" and others are all continuous variables

Function:

summary1 <- function(arg1,arg2) {
    ...

    ss <- ddply(mat, .(index), function(X) data.frame(
        arg1 = as.list(summary(X$arg1)),
        arg2 = as.list(summary(X$arg2)),
        .parallel = FALSE)

    ss
}

I expect the output to look like this after calling summary1("metric","length")

Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max. 

....

Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.

....

At the moment the function does not produce the desired output? What modification should be made here?

Thanks for your help.

Here is a toy example

mat <- data.frame(
    metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
    tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鸢与 2024-11-08 22:50:07

正如 Nick 在他的回答中写道，你不能使用 $< /code> 引用作为角色名称传递的变量。当您编写 X$arg1 时，R 在 data.frame 中搜索名为 "arg1" 的列 X。您可以通过 X[,arg1] 或 X[[arg1]] 引用它。

如果您想要良好命名的输出，我建议以下解决方案：

summary1 <- function(arg1, arg2) {

    ss <- ddply(mat, .(index), function(X) data.frame(
        setNames(
            list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
            c(arg1,arg2)
            )), .parallel = FALSE)

    ss
}
summary1("metric","length")

玩具数据的输出是：

  index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
1  Long           5              7            10         8.6             10
2 Short           7              7             9         8.8             10
  metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
1          11           9             10            11        10.8             12
2          11           4              9             9         9.0             11
  length.Max.
1          12
2          12

As Nick wrote in his answer you can't use $ to reference variable passed as character name. When you wrote X$arg1 then R search for column named "arg1" in data.frame X. You can reference to it either by X[,arg1] or X[[arg1]].

And if you want nicely named output I propose below solution:

summary1 <- function(arg1, arg2) {

    ss <- ddply(mat, .(index), function(X) data.frame(
        setNames(
            list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
            c(arg1,arg2)
            )), .parallel = FALSE)

    ss
}
summary1("metric","length")

Output for toy data is:

  index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
1  Long           5              7            10         8.6             10
2 Short           7              7             9         8.8             10
  metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
1          11           9             10            11        10.8             12
2          11           4              9             9         9.0             11
  length.Max.
1          12
2          12

回复收藏 0 原文

掩耳倾听 2024-11-08 22:50:07

这更像你想要的吗？

summary1 <- function(arg1,arg2) {
ss <- ddply(mat, .(index), function(X){ data.frame(
    arg1 = as.list(summary(X[,arg1])),
    arg2 = as.list(summary(X[,arg2])),
    .parallel = FALSE)})
ss
}

Is this more like what you want?

summary1 <- function(arg1,arg2) {
ss <- ddply(mat, .(index), function(X){ data.frame(
    arg1 = as.list(summary(X[,arg1])),
    arg2 = as.list(summary(X[,arg2])),
    .parallel = FALSE)})
ss
}

回复收藏 0 原文

故笙诉离歌 2024-11-08 22:50:07

由于 ddply 早已过时，skimr 是一种获取分组汇总统计信息的快速方法：

> my_skim <- skim_with(numeric = sfl(median))
> mat %>% group_by(index) %>% my_skim
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             10        
Number of columns          5         
_______________________              
Column type frequency:               
  numeric                  4         
________________________             
Group variables            index     

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
  skim_variable index n_missing complete_rate mean   sd p0 p25 p50 p75 p100 hist  median
1 metric        Long          0             1 10.2 3.70  5   8  11  13   14 ▃▃▁▃▇     11
2 metric        Short         0             1 10.6 3.21  6  10  11  11   15 ▂▁▇▁▂     11
3 length        Long          0             1  9.8 2.05  8   8  10  10   13 ▇▇▁▁▃     10
4 length        Short         0             1  8.6 1.34  7   8   8  10   10 ▃▇▁▁▇      8
5 species       Long          0             1  8.8 4.09  4   7   8  10   15 ▃▇▃▁▃      8
6 species       Short         0             1 11.4 3.36  7   9  12  14   15 ▃▃▁▃▇     12
7 tree          Long          0             1  8.8 3.83  6   6   7  10   15 ▇▁▂▁▂      7
8 tree          Short         0             1  9   2.55  6   8   9   9   13 ▃▃▇▁▃      9

显示的汇总统计信息（如中位数）可以通过传递到 skim_with 中的 sfl() 进行自定义> 工厂。

生成的摘要是基于分组变量 index 的高形式。这比许多宽格式的摘要列更适合使用。您还可以获取摘要数据框而不是打印的文本摘要。

As ddply is long outdated now, skimr is a quick way to get grouped summary statistics:

> my_skim <- skim_with(numeric = sfl(median))
> mat %>% group_by(index) %>% my_skim
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             10        
Number of columns          5         
_______________________              
Column type frequency:               
  numeric                  4         
________________________             
Group variables            index     

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
  skim_variable index n_missing complete_rate mean   sd p0 p25 p50 p75 p100 hist  median
1 metric        Long          0             1 10.2 3.70  5   8  11  13   14 ▃▃▁▃▇     11
2 metric        Short         0             1 10.6 3.21  6  10  11  11   15 ▂▁▇▁▂     11
3 length        Long          0             1  9.8 2.05  8   8  10  10   13 ▇▇▁▁▃     10
4 length        Short         0             1  8.6 1.34  7   8   8  10   10 ▃▇▁▁▇      8
5 species       Long          0             1  8.8 4.09  4   7   8  10   15 ▃▇▃▁▃      8
6 species       Short         0             1 11.4 3.36  7   9  12  14   15 ▃▃▁▃▇     12
7 tree          Long          0             1  8.8 3.83  6   6   7  10   15 ▇▁▂▁▂      7
8 tree          Short         0             1  9   2.55  6   8   9   9   13 ▃▃▇▁▃      9

The summary statistics shown, like median, can be customized with sfl() passed into the skim_with factory.

The resulting summary is in tall form based on grouping variable index. This is better to work with than many summary columns in a wide format. You can also get the summary dataframe instead of the printed text summary.

回复收藏 0 原文

~没有更多了~