有没有办法在 R 中将数据透视_更长到多个值列？

发布于 2025-01-15 09:23:59 字数 2425 浏览 0 评论 0原文

我正在尝试使用pivot_longer来延长我的数据框，但我不需要它很长，并且希望输出多个“值”列。

示例：

df <- tibble(
  ids = c("protein1", "protein2"),
  mean.group1 = sample(1:1000, 2),
  mean.group2 = sample(1:1000, 2),
  se.group1 = sample(1:10, 2),
  se.group2 = sample(1:10, 2)
)

df
# A tibble: 2 × 5
  ids      mean.group1 mean.group2 se.group1 se.group2
  <chr>          <int>       <int>     <int>     <int>
1 protein1         763         456         6         4
2 protein2         820         624         4         7

我想要的输出是：

df2 <- tibble(
  ids = c("protein1", "protein1", "protein2", "protein2"),
  mean = c(df$mean.group1[1], df$mean.group2[1], df$mean.group1[2], df$mean.group2[2]),
  se = c(df$se.group1[1], df$se.group2[1], df$se.group1[2], df$se.group2[2]),
  group = c("group1", "group2", "group1", "group2")
)

df2

# A tibble: 4 × 4
  ids       mean    se group 
  <chr>    <int> <int> <chr> 
1 protein1   763     6 group1
2 protein1   456     4 group2
3 protein2   820     4 group1
4 protein2   624     7 group2

到目前为止，我已经尝试了多个后续的 pivot_longer() ，然后是 unique()，但这搞乱了输出：

df_longer <- df %>%
  pivot_longer(cols = starts_with("mean."),
               names_to = "group",
               names_prefix = "mean.",
               values_to = "mean") %>%
  unique() %>%
  pivot_longer(cols = starts_with("se."),
               names_to = "group",
               names_prefix = "se.",
               values_to = "se",
               names_repair = "unique") %>%
  unique()

df_longer

# A tibble: 8 × 5
  ids      group...2  mean group...4    se
  <chr>    <chr>     <int> <chr>     <int>
1 protein1 group1      763 group1        6
2 protein1 group1      763 group2        4
3 protein1 group2      456 group1        6
4 protein1 group2      456 group2        4
5 protein2 group1      820 group1        4
6 protein2 group1      820 group2        7
7 protein2 group2      624 group1        4
8 protein2 group2      624 group2        7

我有点理解为什么 - 行被重复太多次，因此没有为每行保留组标识。但是，我很难找到解决方案。我知道有一个 names_pattern 选项，但我不确定它在这种情况下如何应用。

任何帮助将不胜感激！我考虑过转换为全长格式（即为每个“平均值”、“se”等设置一个“测量”列），然后使用 pivot_wider() 转换为我需要的格式，但我也不知道该怎么做。另外，如果需要更多信息，请告诉我。我的实际数据集处理4种不同的测量（相同格式，即measurement.group）和数千种蛋白质，但原理应该是相同的，我希望！

原文

I'm trying to use pivot_longer to enlongate my dataframe, but I don't need it to be fully long, and would like to output multiple "values" columns.

Example:

df <- tibble(
  ids = c("protein1", "protein2"),
  mean.group1 = sample(1:1000, 2),
  mean.group2 = sample(1:1000, 2),
  se.group1 = sample(1:10, 2),
  se.group2 = sample(1:10, 2)
)

df
# A tibble: 2 × 5
  ids      mean.group1 mean.group2 se.group1 se.group2
  <chr>          <int>       <int>     <int>     <int>
1 protein1         763         456         6         4
2 protein2         820         624         4         7

My desired output is:

df2 <- tibble(
  ids = c("protein1", "protein1", "protein2", "protein2"),
  mean = c(df$mean.group1[1], df$mean.group2[1], df$mean.group1[2], df$mean.group2[2]),
  se = c(df$se.group1[1], df$se.group2[1], df$se.group1[2], df$se.group2[2]),
  group = c("group1", "group2", "group1", "group2")
)

df2

# A tibble: 4 × 4
  ids       mean    se group 
  <chr>    <int> <int> <chr> 
1 protein1   763     6 group1
2 protein1   456     4 group2
3 protein2   820     4 group1
4 protein2   624     7 group2

So far, I have tried multiple subsequent pivot_longer() followed by unique(), but this is messing up the output:

df_longer <- df %>%
  pivot_longer(cols = starts_with("mean."),
               names_to = "group",
               names_prefix = "mean.",
               values_to = "mean") %>%
  unique() %>%
  pivot_longer(cols = starts_with("se."),
               names_to = "group",
               names_prefix = "se.",
               values_to = "se",
               names_repair = "unique") %>%
  unique()

df_longer

# A tibble: 8 × 5
  ids      group...2  mean group...4    se
  <chr>    <chr>     <int> <chr>     <int>
1 protein1 group1      763 group1        6
2 protein1 group1      763 group2        4
3 protein1 group2      456 group1        6
4 protein1 group2      456 group2        4
5 protein2 group1      820 group1        4
6 protein2 group1      820 group2        7
7 protein2 group2      624 group1        4
8 protein2 group2      624 group2        7

I sort of understand why - the rows are being duplicated too many times, and thus the group identity is not being kept for each row. However, I'm having trouble coming up with a solution. I'm aware that there's a names_pattern option but I'm not sure how it would apply in this case.

Any help would be much appreciated! I've considered pivoting to fully long format (i.e. having a "measurement" column for each 'mean', 'se', etc) and then using pivot_wider() to pivot to the format I need, but I also haven't been able to figure out how to do that. As well, let me know if any more information is needed. My actual dataset deals with 4 different measurements (same format, i.e. measurement.group) and thousands of proteins, but the principle should be the same, I hope!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生生漫 2025-01-22 09:23:59

如果我们将 names_to 指定为值向量，即 .value - 返回列的值并使用后缀对列进行“分组”，则不需要多次调用列名称。在这里，我们使用 names_sep 作为 . 在 . 处进行分割

library(tidyr)
pivot_longer(df, cols  = -ids, names_to = c(".value", "group"), 
    names_sep = "\\.")

-output

# A tibble: 4 × 4
  ids      group   mean    se
  <chr>    <chr>  <int> <int>
1 protein1 group1   982     3
2 protein1 group2   657     7
3 protein2 group1   663     9
4 protein2 group2   215     1

注意：值与 sample 不同用于创建没有指定 set.seed 的输入数据

We don't need multiple calls if we specify the names_to as a vector of values i.e. .value - returns the value of the columns and 'group' the column with the suffix of column names. Here, we use names_sep as . to split at the .

library(tidyr)
pivot_longer(df, cols  = -ids, names_to = c(".value", "group"), 
    names_sep = "\\.")

-output

# A tibble: 4 × 4
  ids      group   mean    se
  <chr>    <chr>  <int> <int>
1 protein1 group1   982     3
2 protein1 group2   657     7
3 protein2 group1   663     9
4 protein2 group2   215     1

NOTE: values are different as sample was used in creation of input data without a set.seed specified

回复收藏 0 原文

~没有更多了~