通过日期选择重复项

发布于 2025-02-03 03:11:27 字数 210 浏览 1 评论 0原文

嗨,我只需要帮助即可从示例数据创建所需的输出。

任何帮助将不胜感激。

enter image description here

Hi, I just need help to create desired output from the sample data.

Any help is greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

冷月断魂刀 2025-02-10 03:11:27

正如评论中已经提到的那样,这在一定程度上“违反”了整洁数据的前提,尽管如果您在报告生成时间进行此操作,这不是一个问题(其中更广泛的布局可能更加美观)。

尝试以下操作:

library(dplyr)
# library(tidyr) # pivot_wider
out <- dat %>%
  # set up for determining groups of columns
  group_by(ID) %>%
  mutate(
    n = row_number(),
    IDn = paste(n, ID, sep = "_")
  ) %>%
  ungroup() %>%
  # pivot so that each group of columns is now in a wider format
  tidyr::pivot_wider(ID, names_from = "n", values_from = c("IDn", "Date", "Score")) %>%
  # convert from the solitary 'ID' field to the duplicated 'IDn' fields,
  # and fix the augmented ID _values_ to remove the 'n' component
  select(-ID) %>%
  mutate(across(starts_with("IDn"), ~ sub(".*_", "", .))) %>%
  # order them based on the trailing number, natural sort after that
  select(colnames(.)[ order(sub(".*_", "", colnames(.))) ])
out
# # A tibble: 10 x 9
#    IDn_1 Date_1 Score_1 IDn_2 Date_2 Score_2 IDn_3 Date_3 Score_3
#    <chr> <chr>  <chr>   <chr> <chr>  <chr>   <chr> <chr>  <chr>  
#  1 101   44656  10      101   44660  50      101   44676  65     
#  2 102   44656  30      102   44678  66      NA    NA     NA     
#  3 103   44658  40      103   44673  32      NA    NA     NA     
#  4 104   44659  20      104   44682  65      NA    NA     NA     
#  5 105   44661  35      NA    NA     NA      NA    NA     NA     
#  6 106   44661  45      NA    NA     NA      NA    NA     NA     
#  7 107   44682  51      NA    NA     NA      NA    NA     NA     
#  8 108   44683  25      NA    NA     NA      NA    NA     NA     
#  9 109   44684  39      NA    NA     NA      NA    NA     NA     
# 10 110   44685  43      NA    NA     NA      NA    NA     NA     

结果有两件事“错误”:

  • id列中有一个n,主要是因为我选择了如何通过每个组重复它们列。

  • 所有列都有_#附加。通常,R更喜欢其列名是唯一的。它可以被强制强制,但是从该点开始对列的任何引用(如果使用列名称)或需要整数列索引(可能是脆弱的)。

  • 然而,dplyr和大多数tidyverse一般而言,更偏爱重复名称,以至于我找不到优雅/通用选择(..)一个步骤。

您必须像问题中的名称一样

names(out) <- sub("n?_.*", "", names(out))
out
# # A tibble: 10 x 9
#    ID    Date  Score ID    Date  Score ID    Date  Score
#    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#  1 101   44656 10    101   44660 50    101   44676 65   
#  2 102   44656 30    102   44678 66    NA    NA    NA   
#  3 103   44658 40    103   44673 32    NA    NA    NA   
#  4 104   44659 20    104   44682 65    NA    NA    NA   
#  5 105   44661 35    NA    NA    NA    NA    NA    NA   
#  6 106   44661 45    NA    NA    NA    NA    NA    NA   
#  7 107   44682 51    NA    NA    NA    NA    NA    NA   
#  8 108   44683 25    NA    NA    NA    NA    NA    NA   
#  9 109   44684 39    NA    NA    NA    NA    NA    NA   
# 10 110   44685 43    NA    NA    NA    NA    NA    NA   

如果 您很多。)


数据(来自OP评论)

dat <- structure(list(ID = c("101", "102", "103", "104", "101", "105", "106", "103", "101", "102", "104", "107", "108", "109", "110" ), Date = c("44656", "44656", "44658", "44659", "44660", "44661", "44661", "44673", "44676", "44678", "44682", "44682", "44683", "44684", "44685"), Score = c("10", "30", "40", "20", "50", "35", "45", "32", "65", "66", "65", "51", "25", "39", "43")), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))

As already mentioned in comments, this "violates" the premise of tidy data to a certain extent, though it is less a problem if you are doing this at report-generation time (where wider layouts may be more aesthetically functional).

Try this:

library(dplyr)
# library(tidyr) # pivot_wider
out <- dat %>%
  # set up for determining groups of columns
  group_by(ID) %>%
  mutate(
    n = row_number(),
    IDn = paste(n, ID, sep = "_")
  ) %>%
  ungroup() %>%
  # pivot so that each group of columns is now in a wider format
  tidyr::pivot_wider(ID, names_from = "n", values_from = c("IDn", "Date", "Score")) %>%
  # convert from the solitary 'ID' field to the duplicated 'IDn' fields,
  # and fix the augmented ID _values_ to remove the 'n' component
  select(-ID) %>%
  mutate(across(starts_with("IDn"), ~ sub(".*_", "", .))) %>%
  # order them based on the trailing number, natural sort after that
  select(colnames(.)[ order(sub(".*_", "", colnames(.))) ])
out
# # A tibble: 10 x 9
#    IDn_1 Date_1 Score_1 IDn_2 Date_2 Score_2 IDn_3 Date_3 Score_3
#    <chr> <chr>  <chr>   <chr> <chr>  <chr>   <chr> <chr>  <chr>  
#  1 101   44656  10      101   44660  50      101   44676  65     
#  2 102   44656  30      102   44678  66      NA    NA     NA     
#  3 103   44658  40      103   44673  32      NA    NA     NA     
#  4 104   44659  20      104   44682  65      NA    NA     NA     
#  5 105   44661  35      NA    NA     NA      NA    NA     NA     
#  6 106   44661  45      NA    NA     NA      NA    NA     NA     
#  7 107   44682  51      NA    NA     NA      NA    NA     NA     
#  8 108   44683  25      NA    NA     NA      NA    NA     NA     
#  9 109   44684  39      NA    NA     NA      NA    NA     NA     
# 10 110   44685  43      NA    NA     NA      NA    NA     NA     

The results have two things "wrong":

  • The ID columns have an n in there, mostly because of how I chose to duplicate them across each group of columns.

  • All columns have _# appended. R in general prefers its column names to be unique. It can be coerced, but it makes any references to columns from that point on either ambiguous (if using column names) or requiring integer column indices (which can be fragile).

  • However, dplyr and most of the tidyverse in general more-strongly prefer against duplicate names, so much so that I couldn't find an elegant/generic select(..) to do it in one step.

If you must have the names as in your question, it can be done outside of dplyr with:

names(out) <- sub("n?_.*", "", names(out))
out
# # A tibble: 10 x 9
#    ID    Date  Score ID    Date  Score ID    Date  Score
#    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#  1 101   44656 10    101   44660 50    101   44676 65   
#  2 102   44656 30    102   44678 66    NA    NA    NA   
#  3 103   44658 40    103   44673 32    NA    NA    NA   
#  4 104   44659 20    104   44682 65    NA    NA    NA   
#  5 105   44661 35    NA    NA    NA    NA    NA    NA   
#  6 106   44661 45    NA    NA    NA    NA    NA    NA   
#  7 107   44682 51    NA    NA    NA    NA    NA    NA   
#  8 108   44683 25    NA    NA    NA    NA    NA    NA   
#  9 109   44684 39    NA    NA    NA    NA    NA    NA   
# 10 110   44685 43    NA    NA    NA    NA    NA    NA   

(Sure, it's possible to put that somehow in a dplyr pipe, but I don't know if that buys you much.)


Data (from the OP comments)

dat <- structure(list(ID = c("101", "102", "103", "104", "101", "105", "106", "103", "101", "102", "104", "107", "108", "109", "110" ), Date = c("44656", "44656", "44658", "44659", "44660", "44661", "44661", "44673", "44676", "44678", "44682", "44682", "44683", "44684", "44685"), Score = c("10", "30", "40", "20", "50", "35", "45", "32", "65", "66", "65", "51", "25", "39", "43")), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文