r survival-analysis dplyr data-transform

种子发芽数据：将时间数据从短格式转换为长格式以进行生存分析

发布于 2025-01-21 00:30:50 字数 2552 浏览 5 评论 0原文

我正在使用生存分析评估幼苗的出现率，我想自动化将简短表格收集的数据转换为R中的长形式进行分析的过程。

这是收集到的数据格式和日期转换的一个示例：

prac.dat <- tribble(
  ~ID, ~ImbibtionStartDate, ~Survey1date, ~Survey1totalcounts, ~Survey2date, ~Survey2totalcounts,~Survey3date, ~Survey3totalcounts, ~Total_sown_seeds,
  #--/--------------------/-------------/--------------------/-------------/------------------/---------------/------------------/-----------------/
  "ID1", "3/22/2022 14:20","3/24/2022 16:45", 0, "3/25/2022 16:00", 8, "3/26/2022 13:00", 21, 25,
  "ID2", "3/22/2022 14:20","3/24/2022 16:45", 1, "3/25/2022 16:00", 4, "3/26/2022 13:00", 11, 25,
)

prac.dat <- prac.dat %>% 
  mutate(ImbibtionStartDate=as.POSIXct(ImbibtionStartDate, format="%m/%d/%Y %H:%M"),
         Survey1date=as.POSIXct(Survey1date, format="%m/%d/%Y %H:%M"),
         Survey2date=as.POSIXct(Survey2date, format="%m/%d/%Y %H:%M"),
         Survey3date=as.POSIXct(Survey3date, format="%m/%d/%Y %H:%M"))

在此数据中set，“ id” 是种子播种的锅的身份，“ Imbibtionstartdate” 是首先浇水的日期和时间， “ Survey1Date” （和其他调查日期列）是进行调查的日期和时间，以计算幼苗的紧急情况，“ Survey1TotAlcounts” [和其他调查列列]表示累积数字在该锅中出现在该锅中的幼苗和“ total_sown_seeds” 表示播种的种子总数。

我的目标是一个数据集，即1）为每个锅中的每个种子生成一行（锅识别由“ ID”列表示），2）指示种子是否出现（“ 1”）或不出现（“ 0） ”）在整个研究期间，以及3）计算每个种子出现所需的特定时间（通过首次发现幼苗的调查日期和时间之间的差异以及Imbibtion的开始日期和时间之间的差异）。

我希望最终的输出看起来像这样：

desired.output <- tribble(
  ~ID, ~Emg_Poa, ~time_to_emg,
  #Unique Id for each Seed/
  #whether that seed emerged ("1") or not ("0") by the final survey date/
  #days it took for that seed to emerge from imbibtion start to survey date/
  "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA,
  "ID2",1, 2.10, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94,
  "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",0, NA, "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,
  "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA, "ID2",0, NA, "ID2",0, NA, "ID2",0, NA
  )

迄今为止，我已经将这些转换从一个Excel到另一个excel进行，但是为了最大程度地减少错误和节省时间，我很好奇是否有人愿意提出建议在R中自动化此过程的一种方法，此任务超出了我当前的R数据框架生成功能。感谢您的时间，考虑和投入。

原文

I am evaluating seedling emergence rates using survival analysis and I would like to automate the process of converting the short form collected data into the long form for analysis in R.

Here is an example of the collected data format and the date conversion:

prac.dat <- tribble(
  ~ID, ~ImbibtionStartDate, ~Survey1date, ~Survey1totalcounts, ~Survey2date, ~Survey2totalcounts,~Survey3date, ~Survey3totalcounts, ~Total_sown_seeds,
  #--/--------------------/-------------/--------------------/-------------/------------------/---------------/------------------/-----------------/
  "ID1", "3/22/2022 14:20","3/24/2022 16:45", 0, "3/25/2022 16:00", 8, "3/26/2022 13:00", 21, 25,
  "ID2", "3/22/2022 14:20","3/24/2022 16:45", 1, "3/25/2022 16:00", 4, "3/26/2022 13:00", 11, 25,
)

prac.dat <- prac.dat %>% 
  mutate(ImbibtionStartDate=as.POSIXct(ImbibtionStartDate, format="%m/%d/%Y %H:%M"),
         Survey1date=as.POSIXct(Survey1date, format="%m/%d/%Y %H:%M"),
         Survey2date=as.POSIXct(Survey2date, format="%m/%d/%Y %H:%M"),
         Survey3date=as.POSIXct(Survey3date, format="%m/%d/%Y %H:%M"))

In this data set, "ID" is the identity of the pot where seeds were sown, "ImbibtionStartDate" is the date and time when seeds in the soil were first watered, "Survey1date" [and other survey date columns] are the date and time a survey was conducted to count total seedling emergents, "Survey1totalcounts" [and other survey count columns] indicate the cumulative number of seedlings that have emerged in that pot by that survey date, and "Total_sown_seeds" indicates the total number of seeds that were sown in a pot.

I aiming for a data set that 1) generates a row for every seed in every pot (pot identification is represented by the "ID" column), 2) indicates whether the seed emerged ("1") or did not emerge ("0") over the course of the study period, and 3) calculates the specific time it took for each seed to emerge (estimated by difference between the Survey date and time when the seedling was first spotted and the imbibtion start date and time).

I would like the final output to look something like this:

desired.output <- tribble(
  ~ID, ~Emg_Poa, ~time_to_emg,
  #Unique Id for each Seed/
  #whether that seed emerged ("1") or not ("0") by the final survey date/
  #days it took for that seed to emerge from imbibtion start to survey date/
  "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA,
  "ID2",1, 2.10, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94,
  "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",0, NA, "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,
  "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA, "ID2",0, NA, "ID2",0, NA, "ID2",0, NA
  )

To date, I've done these conversions by hand from one excel into another, but in the interest of minimizing errors and saving time, I am curious if anyone would be willing to propose a method of automating this process in R. This task is beyond my current functional capacity in R data frame generation. Thank you for your time, consideration, and input.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风苍溪 2025-01-28 00:30:50

从prac.dat到您所需的输出有点棘手，但肯定有可能。首先，让我们将prac.dat进入“长”格式并计算一些有用的列：

prac.long <- prac.dat %>% 
  pivot_longer(matches('counts|Survey.*date'), names_to = c('survey_num', '.value'), names_pattern = 'Survey(\\d)(.*)') %>% 
  rename(survey_date = date, count = totalcounts) %>% 
  group_by(ID) %>% 
  mutate(
    across(c(ImbibtionStartDate, survey_date), ~as.POSIXct(., format="%m/%d/%Y %H:%M")),
    not_emerged = Total_sown_seeds - max(count),
    time_to_emerge = survey_date - ImbibtionStartDate,
    emerged_at_survey = count - lag(count),
    emerged_at_survey = ifelse(is.na(emerged_at_survey), count[1], emerged_at_survey)
  ) 

  ID    ImbibtionStartDate  Total_sown_seeds survey_num survey_date         count not_emerged
  <chr> <dttm>                         <dbl> <chr>      <dttm>              <dbl>       <dbl>
1 ID1   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     0           4
2 ID1   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     8           4
3 ID1   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    21           4
4 ID2   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     1          14
5 ID2   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     4          14
6 ID2   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    11          14
# … with 2 more variables: time_to_emerge <drtn>, emerged_at_survey <dbl>

我们还需要计算未出现的种子计数：

prac.unemerged <- select(prac.long, ID, not_emerged) %>% 
  distinct %>% 
  mutate(time_to_emerge = NA) %>% 
  rename(count = not_emerged)

  ID    count time_to_emerge
  <chr> <dbl> <lgl>         
1 ID1       4 NA            
2 ID2      14 NA

最后，我们将出现的种子及其它们的计数结合在一起是时候使用data.Unemerged发芽的时间，并使用uncount扩展到您所需的输出：

result <- select(prac.long, ID, time_to_emerge, count = emerged_at_survey) %>% 
  bind_rows(prac.unemerged) %>% 
  uncount(weights = count) %>% 
  mutate(Emg_poa = as.numeric(!is.na(time_to_emerge))) %>% 
  arrange(ID, time_to_emerge)

   ID    time_to_emerge Emg_poa
   <chr> <drtn>           <dbl>
 1 ID1   3.069444 days        1
 2 ID1   3.069444 days        1
 3 ID1   3.069444 days        1
 4 ID1   3.069444 days        1
 5 ID1   3.069444 days        1
 6 ID1   3.069444 days        1
 7 ID1   3.069444 days        1
 8 ID1   3.069444 days        1
 9 ID1   3.944444 days        1
10 ID1   3.944444 days        1

Getting from prac.dat to your desired output is a bit tricky, but certainly possible. First, let's get prac.dat into "long" format and calculate a few useful columns:

prac.long <- prac.dat %>% 
  pivot_longer(matches('counts|Survey.*date'), names_to = c('survey_num', '.value'), names_pattern = 'Survey(\\d)(.*)') %>% 
  rename(survey_date = date, count = totalcounts) %>% 
  group_by(ID) %>% 
  mutate(
    across(c(ImbibtionStartDate, survey_date), ~as.POSIXct(., format="%m/%d/%Y %H:%M")),
    not_emerged = Total_sown_seeds - max(count),
    time_to_emerge = survey_date - ImbibtionStartDate,
    emerged_at_survey = count - lag(count),
    emerged_at_survey = ifelse(is.na(emerged_at_survey), count[1], emerged_at_survey)
  ) 

  ID    ImbibtionStartDate  Total_sown_seeds survey_num survey_date         count not_emerged
  <chr> <dttm>                         <dbl> <chr>      <dttm>              <dbl>       <dbl>
1 ID1   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     0           4
2 ID1   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     8           4
3 ID1   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    21           4
4 ID2   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     1          14
5 ID2   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     4          14
6 ID2   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    11          14
# … with 2 more variables: time_to_emerge <drtn>, emerged_at_survey <dbl>

We also need to calculate counts of seeds that did not emerge:

prac.unemerged <- select(prac.long, ID, not_emerged) %>% 
  distinct %>% 
  mutate(time_to_emerge = NA) %>% 
  rename(count = not_emerged)

  ID    count time_to_emerge
  <chr> <dbl> <lgl>         
1 ID1       4 NA            
2 ID2      14 NA

Finally, we combine the counts of emerged seeds and their time to germination with the data.unemerged, and use uncount to expand to your desired output:

result <- select(prac.long, ID, time_to_emerge, count = emerged_at_survey) %>% 
  bind_rows(prac.unemerged) %>% 
  uncount(weights = count) %>% 
  mutate(Emg_poa = as.numeric(!is.na(time_to_emerge))) %>% 
  arrange(ID, time_to_emerge)

   ID    time_to_emerge Emg_poa
   <chr> <drtn>           <dbl>
 1 ID1   3.069444 days        1
 2 ID1   3.069444 days        1
 3 ID1   3.069444 days        1
 4 ID1   3.069444 days        1
 5 ID1   3.069444 days        1
 6 ID1   3.069444 days        1
 7 ID1   3.069444 days        1
 8 ID1   3.069444 days        1
 9 ID1   3.944444 days        1
10 ID1   3.944444 days        1

回复收藏 0 原文

~没有更多了~