在汇总统计表中添加更高级别的分组

发布于 2025-01-12 18:34:08 字数 3102 浏览 2 评论 0原文

不久前我问如何制作分组汇总表: 如何在 R 中对描述性统计数据的列进行分组?

我想做一些与此类似的事情,但要更进一步,但我不确定如何继续。

这是我到目前为止所拥有的:

data %>%
  dplyr::filter_all(all_vars(!is.na(.))) %>%
  group_by(Type.Time, Type.Perc, Grp) %>%
  dplyr::summarise(mean.ms = sprintf("%.2f", mean(Time, na.rm = TRUE)),
                   se.ms = sprintf("%.2f", (sd(Time, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
                   mean.perc = sprintf("%.2f", mean(Percentage, na.rm = TRUE)),
                   se.perc = sprintf("%.2f", (sd(Percentage, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
                   ) %>%
  gather(key, value, mean.ms:se.perc) %>%
  unite(Group, Grp, key) %>%
  spread(Group, value)

这为我提供了我想要的信息,但格式错误且值是原来的两倍:

| Type.Time | Type.Perc | 1_mean.ms | 1_mean.perc | 1_se.ms | 1_se.perc | 2_mean.ms | 2_mean.perc | 2_se.ms | 2_se.perc|
|-----------|-----------|-----------|-------------|---------|-----------|-----------|-------------|---------|----------|
| TType2    | PType2    | 703       | 15          | 15      | 1.4       | 573       | 8           | 22      | 1.3      |       
| TType2    | PType1    | 703       | 10          | 15      | 1.8       | 573       | 13          | 22      | 3.1      |
| TType1    | PType2    | 710       | 15          | 18      | 1.4       | 622       | 8           | 29      | 1.3      |
| TType1    | PType1    | 710       | 10          | 18      | 1.8       | 622       | 13          | 29      | 3.1      |

我希望新表中的顶部分组是 1 或 2(即 Grp [Group] )位于“mean”/“se”之前。然后是 Type1 和 Type 2 的子组,前面的 T 和 P 被分割为行(分别为 ms 和 %)...所以我的目标是生成一个这种格式的表:

     |         Group1         |          Group2           |
     |------------------------|---------------------------|
     |    Type1   |   Type2   |    Type1   |    Type2     |
     |------------|-----------|------------|--------------|
     |   M |  SE  |  M  | SE  |   M  | SE  |   M  |  SE   |
|----|-----|------|-----|-----|------|-----|------|-------|
|ms  | [values calculated from 'Time' variable]           |
|%   | [values calculated from 'Percentage' variable]     |

我希望这是有道理的!

示例数据:

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), Grp = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Type.Time = c("TType1", 
"TType1", "TType2", "TType2", "TType1", "TType1", "TType2", "TType2", 
"TType1", "TType1", "TType2", "TType2", "TType1", "TType1", 
"TType2", "TType2"), Time = c(711, 711, 669, 669, 765, 765, 876, 876, 740, 
740, 658, 658, 456, 456, 423, 423), Type.Perc = c("PType1", 
"PType2", "PType1", "PType2", "PType1", "PType2", 
"PType1", "PType2", "PType1", "PType2", "PType1", 
"PType2", "PType1", "PType2", "PType1", "PType2"
), Percentage = c(8, 3, 9, 7, 19, 22, 30, 21, 10, 5, 10, 5, 8, 7, 
13, 5)), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

A while ago I asked how to make a grouped summary table: How can I group columns of descriptive statistics in R?

I'd like to do something similar to this, but a few steps further, and I'm not sure how to proceed.

Here's what I have so far:

data %>%
  dplyr::filter_all(all_vars(!is.na(.))) %>%
  group_by(Type.Time, Type.Perc, Grp) %>%
  dplyr::summarise(mean.ms = sprintf("%.2f", mean(Time, na.rm = TRUE)),
                   se.ms = sprintf("%.2f", (sd(Time, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
                   mean.perc = sprintf("%.2f", mean(Percentage, na.rm = TRUE)),
                   se.perc = sprintf("%.2f", (sd(Percentage, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
                   ) %>%
  gather(key, value, mean.ms:se.perc) %>%
  unite(Group, Grp, key) %>%
  spread(Group, value)

This gives me the information I want, but in the wrong format and twice as many values:

| Type.Time | Type.Perc | 1_mean.ms | 1_mean.perc | 1_se.ms | 1_se.perc | 2_mean.ms | 2_mean.perc | 2_se.ms | 2_se.perc|
|-----------|-----------|-----------|-------------|---------|-----------|-----------|-------------|---------|----------|
| TType2    | PType2    | 703       | 15          | 15      | 1.4       | 573       | 8           | 22      | 1.3      |       
| TType2    | PType1    | 703       | 10          | 15      | 1.8       | 573       | 13          | 22      | 3.1      |
| TType1    | PType2    | 710       | 15          | 18      | 1.4       | 622       | 8           | 29      | 1.3      |
| TType1    | PType1    | 710       | 10          | 18      | 1.8       | 622       | 13          | 29      | 3.1      |

I'd like the top grouping in my new table to be the 1 or 2 (i.e., Grp [Group]) that precedes 'mean'/'se'. Then subgroups of Type1 and Type 2, with the preceding T and P being split as the rows (ms and % respectively)... So my aim is to produce a table in this format:

     |         Group1         |          Group2           |
     |------------------------|---------------------------|
     |    Type1   |   Type2   |    Type1   |    Type2     |
     |------------|-----------|------------|--------------|
     |   M |  SE  |  M  | SE  |   M  | SE  |   M  |  SE   |
|----|-----|------|-----|-----|------|-----|------|-------|
|ms  | [values calculated from 'Time' variable]           |
|%   | [values calculated from 'Percentage' variable]     |

I hope that makes sense!

Example data:

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), Grp = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Type.Time = c("TType1", 
"TType1", "TType2", "TType2", "TType1", "TType1", "TType2", "TType2", 
"TType1", "TType1", "TType2", "TType2", "TType1", "TType1", 
"TType2", "TType2"), Time = c(711, 711, 669, 669, 765, 765, 876, 876, 740, 
740, 658, 658, 456, 456, 423, 423), Type.Perc = c("PType1", 
"PType2", "PType1", "PType2", "PType1", "PType2", 
"PType1", "PType2", "PType1", "PType2", "PType1", 
"PType2", "PType1", "PType2", "PType1", "PType2"
), Percentage = c(8, 3, 9, 7, 19, 22, 30, 21, 10, 5, 10, 5, 8, 7, 
13, 5)), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

本王不退位尔等都是臣 2025-01-19 18:34:08

配置此类标头分组的一种选项是使用 kableExtra 包。

对于数据准备,我做了两个主要更改 - 仅考虑 Type.Time == Type.Perc(以避免问题中显示的过多组合),并计算每个 Type& 的 SE 值。组(在示例代码中,这混合了不同的分组,我认为这不是有意的)。

library(tidyverse)
df <- data %>%
  dplyr::filter_all(all_vars(!is.na(.))) %>%
  dplyr::mutate(
    Type = stringr::str_extract(Type.Time, "Type[0-9]"),
    Type.Perc = stringr::str_extract(Type.Perc, "Type[0-9]"),
  ) %>%
  dplyr::filter(Type == Type.Perc) %>%
  dplyr::select(-Type.Perc, -Type.Time, -ID) %>%
  pivot_longer(c(Percentage, Time), names_to = "parameter") %>%
  group_by(Type, Grp, parameter) %>%
  dplyr::summarise(
    mean = sprintf("%.2f", mean(value, na.rm = TRUE)),
    se = sprintf("%.2f", (sd(value, na.rm = T))/sqrt(n())),
    .groups = "drop"
  ) %>%
  tidyr::pivot_longer(c(mean, se)) %>%
  arrange(Grp, Type) %>%
  tidyr::pivot_wider(id_cols = "parameter", names_from = c("Grp", "Type", "name"))

# A tibble: 2 x 9
  parameter  `1_Type1_mean` `1_Type1_se` `1_Type2_mean` `1_Type2_se` `2_Type1_mean`
  <chr>      <chr>          <chr>        <chr>          <chr>        <chr>         
1 Percentage 13.50          5.50         14.00          7.00         9.00          
2 Time       738.00         27.00        772.50         103.50       598.00        
# ... with 3 more variables: `2_Type1_se` <chr>, `2_Type2_mean` <chr>,
#   `2_Type2_se` <chr>

这些值已经采用正确的格式,我们可以使用 add_header_above 简单地定义几个标头分组。 kableExtra 提供了大量用于修改输出格式的附加选项。

library(kableExtra)

kable(df, col.names = c("", "M", "SE", "M", "SE", "M", "SE", "M", "SE"),
      align = c("l", "r", "r", "r", "r", "r", "r", "r", "r", "r"),
      format = "html") %>%
  kable_styling() %>%
  add_header_above(c(" ", "Type1" = 2, "Type2" = 2, "Type1" = 2, "Type2" = 2)) %>%
  add_header_above(c(" ", "Group1" = 4, "Group2" = 4))

在此处输入图像描述

One option to configure such header groupings is with the kableExtra package.

For the data preparation, I've made two main changes - only considering Type.Time == Type.Perc (to avoid the excess combinations shown in the question), and calculating the SE values per Type&Group (in the example code this mixes different groupings, which I assume is not intended).

library(tidyverse)
df <- data %>%
  dplyr::filter_all(all_vars(!is.na(.))) %>%
  dplyr::mutate(
    Type = stringr::str_extract(Type.Time, "Type[0-9]"),
    Type.Perc = stringr::str_extract(Type.Perc, "Type[0-9]"),
  ) %>%
  dplyr::filter(Type == Type.Perc) %>%
  dplyr::select(-Type.Perc, -Type.Time, -ID) %>%
  pivot_longer(c(Percentage, Time), names_to = "parameter") %>%
  group_by(Type, Grp, parameter) %>%
  dplyr::summarise(
    mean = sprintf("%.2f", mean(value, na.rm = TRUE)),
    se = sprintf("%.2f", (sd(value, na.rm = T))/sqrt(n())),
    .groups = "drop"
  ) %>%
  tidyr::pivot_longer(c(mean, se)) %>%
  arrange(Grp, Type) %>%
  tidyr::pivot_wider(id_cols = "parameter", names_from = c("Grp", "Type", "name"))

# A tibble: 2 x 9
  parameter  `1_Type1_mean` `1_Type1_se` `1_Type2_mean` `1_Type2_se` `2_Type1_mean`
  <chr>      <chr>          <chr>        <chr>          <chr>        <chr>         
1 Percentage 13.50          5.50         14.00          7.00         9.00          
2 Time       738.00         27.00        772.50         103.50       598.00        
# ... with 3 more variables: `2_Type1_se` <chr>, `2_Type2_mean` <chr>,
#   `2_Type2_se` <chr>

The values are already in the right format, and we can simply define several header groupings with add_header_above. And kableExtra provides plenty of additional options for modyfing the output format.

library(kableExtra)

kable(df, col.names = c("", "M", "SE", "M", "SE", "M", "SE", "M", "SE"),
      align = c("l", "r", "r", "r", "r", "r", "r", "r", "r", "r"),
      format = "html") %>%
  kable_styling() %>%
  add_header_above(c(" ", "Type1" = 2, "Type2" = 2, "Type1" = 2, "Type2" = 2)) %>%
  add_header_above(c(" ", "Group1" = 4, "Group2" = 4))

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文