种子发芽数据:将时间数据从短格式转换为长格式以进行生存分析
我正在使用生存分析评估幼苗的出现率,我想自动化将简短表格收集的数据转换为R中的长形式进行分析的过程。
这是收集到的数据格式和日期转换的一个示例:
prac.dat <- tribble(
~ID, ~ImbibtionStartDate, ~Survey1date, ~Survey1totalcounts, ~Survey2date, ~Survey2totalcounts,~Survey3date, ~Survey3totalcounts, ~Total_sown_seeds,
#--/--------------------/-------------/--------------------/-------------/------------------/---------------/------------------/-----------------/
"ID1", "3/22/2022 14:20","3/24/2022 16:45", 0, "3/25/2022 16:00", 8, "3/26/2022 13:00", 21, 25,
"ID2", "3/22/2022 14:20","3/24/2022 16:45", 1, "3/25/2022 16:00", 4, "3/26/2022 13:00", 11, 25,
)
prac.dat <- prac.dat %>%
mutate(ImbibtionStartDate=as.POSIXct(ImbibtionStartDate, format="%m/%d/%Y %H:%M"),
Survey1date=as.POSIXct(Survey1date, format="%m/%d/%Y %H:%M"),
Survey2date=as.POSIXct(Survey2date, format="%m/%d/%Y %H:%M"),
Survey3date=as.POSIXct(Survey3date, format="%m/%d/%Y %H:%M"))
在此数据中set,“ id” 是种子播种的锅的身份,“ Imbibtionstartdate” 是首先浇水的日期和时间, “ Survey1Date” (和其他调查日期列)是进行调查的日期和时间,以计算幼苗的紧急情况,“ Survey1TotAlcounts” [和其他调查列列]表示累积数字在该锅中出现在该锅中的幼苗和“ total_sown_seeds” 表示播种的种子总数。
我的目标是一个数据集,即1)为每个锅中的每个种子生成一行(锅识别由“ ID”列表示),2)指示种子是否出现(“ 1”)或不出现(“ 0) ”)在整个研究期间,以及3)计算每个种子出现所需的特定时间(通过首次发现幼苗的调查日期和时间之间的差异以及Imbibtion的开始日期和时间之间的差异)。
我希望最终的输出看起来像这样:
desired.output <- tribble(
~ID, ~Emg_Poa, ~time_to_emg,
#Unique Id for each Seed/
#whether that seed emerged ("1") or not ("0") by the final survey date/
#days it took for that seed to emerge from imbibtion start to survey date/
"ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07,
"ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94,
"ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA,
"ID2",1, 2.10, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94,
"ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",0, NA, "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,
"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA, "ID2",0, NA, "ID2",0, NA, "ID2",0, NA
)
迄今为止,我已经将这些转换从一个Excel到另一个excel进行,但是为了最大程度地减少错误和节省时间,我很好奇是否有人愿意提出建议在R中自动化此过程的一种方法,此任务超出了我当前的R数据框架生成功能。感谢您的时间,考虑和投入。
I am evaluating seedling emergence rates using survival analysis and I would like to automate the process of converting the short form collected data into the long form for analysis in R.
Here is an example of the collected data format and the date conversion:
prac.dat <- tribble(
~ID, ~ImbibtionStartDate, ~Survey1date, ~Survey1totalcounts, ~Survey2date, ~Survey2totalcounts,~Survey3date, ~Survey3totalcounts, ~Total_sown_seeds,
#--/--------------------/-------------/--------------------/-------------/------------------/---------------/------------------/-----------------/
"ID1", "3/22/2022 14:20","3/24/2022 16:45", 0, "3/25/2022 16:00", 8, "3/26/2022 13:00", 21, 25,
"ID2", "3/22/2022 14:20","3/24/2022 16:45", 1, "3/25/2022 16:00", 4, "3/26/2022 13:00", 11, 25,
)
prac.dat <- prac.dat %>%
mutate(ImbibtionStartDate=as.POSIXct(ImbibtionStartDate, format="%m/%d/%Y %H:%M"),
Survey1date=as.POSIXct(Survey1date, format="%m/%d/%Y %H:%M"),
Survey2date=as.POSIXct(Survey2date, format="%m/%d/%Y %H:%M"),
Survey3date=as.POSIXct(Survey3date, format="%m/%d/%Y %H:%M"))
In this data set, "ID" is the identity of the pot where seeds were sown, "ImbibtionStartDate" is the date and time when seeds in the soil were first watered, "Survey1date" [and other survey date columns] are the date and time a survey was conducted to count total seedling emergents, "Survey1totalcounts" [and other survey count columns] indicate the cumulative number of seedlings that have emerged in that pot by that survey date, and "Total_sown_seeds" indicates the total number of seeds that were sown in a pot.
I aiming for a data set that 1) generates a row for every seed in every pot (pot identification is represented by the "ID" column), 2) indicates whether the seed emerged ("1") or did not emerge ("0") over the course of the study period, and 3) calculates the specific time it took for each seed to emerge (estimated by difference between the Survey date and time when the seedling was first spotted and the imbibtion start date and time).
I would like the final output to look something like this:
desired.output <- tribble(
~ID, ~Emg_Poa, ~time_to_emg,
#Unique Id for each Seed/
#whether that seed emerged ("1") or not ("0") by the final survey date/
#days it took for that seed to emerge from imbibtion start to survey date/
"ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07,
"ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94,
"ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA,
"ID2",1, 2.10, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94,
"ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",0, NA, "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,
"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA, "ID2",0, NA, "ID2",0, NA, "ID2",0, NA
)
To date, I've done these conversions by hand from one excel into another, but in the interest of minimizing errors and saving time, I am curious if anyone would be willing to propose a method of automating this process in R. This task is beyond my current functional capacity in R data frame generation. Thank you for your time, consideration, and input.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从
prac.dat
到您所需的输出有点棘手,但肯定有可能。首先,让我们将prac.dat
进入“长”格式并计算一些有用的列:我们还需要计算未出现的种子计数:
最后,我们将出现的种子及其它们的计数结合在一起是时候使用
data.Unemerged
发芽的时间,并使用uncount
扩展到您所需的输出:Getting from
prac.dat
to your desired output is a bit tricky, but certainly possible. First, let's getprac.dat
into "long" format and calculate a few useful columns:We also need to calculate counts of seeds that did not emerge:
Finally, we combine the counts of emerged seeds and their time to germination with the
data.unemerged
, and useuncount
to expand to your desired output: