按哨兵事件对不规则时间序列数据进行分组

发布于 2025-01-15 02:11:48 字数 894 浏览 5 评论 0原文

我有数百名患者的 df 和在不同日期测量的实验室值。实验室以不规则的时间间隔绘制。我有兴趣看看这个值是否上升到临界点（例如 - 先前值基线的 1.5 倍），并且需要将其标记为事件。如果实验室值是在 7 天内获取的，我需要将这些条目分组在一起作为一个“情节”，并且如果该情节中发生了事件，则应标记整个情节。在标记的剧集中，我需要标记值何时高于其基线的 1.2 倍，当值上升/下降至 1.2 倍基线时，这将用于标记事件的真实开始/结束时间。如果两个情节在 7 天内发生，我需要将两个情节中和之间的所有值标记为单个情节。我的最终目标是计算每个人的“剧集”数量，并最终排除剧集中的值以供以后分析。

我可以使用 dplyr/mutate 创建一个新列来标记各个哨兵事件，但我无法弄清楚如何将行分组为 7 天的事件。

任何帮助将不胜感激！

示例 dput df

structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3
), lab = c(0.9, 1.2, 1.4, 1, 1, 1.4, 1.6, 1.9, 1.5, 1.4, 1.4, 
1.6, 1.8, 1.5, 1.3, 1.3, 1, 1, 1.7, 1.4, 1.4, 1.6), date = structure(c(13375, 
13378, 13380, 13382, 13386, 16559, 16630, 16633, 16648, 17065, 
17070, 17091, 17093, 17096, 17172, 17225, 17871, 18033, 18158, 
18162, 18278, 18635), class = "Date")), row.names = c(NA, -22L
), class = c("tbl_df", "tbl", "data.frame"))

原文

I have a df of hundreds of patients and a lab value measured at various dates. The labs were drawn at irregular time intervals. I am interested in seeing if this value rises to a critical point (example-1.5x the baseline of prior values), and need to mark this as an event. If lab values are taken within 7 days of each other I need these entries grouped together as an "episode", and if an event occurs within the episode, the entire episode should be flagged. Among flagged episodes I need to mark when the value is above 1.2x its baseline, when the value rises/falls to 1.2x baseline this will be used to mark the true start/end time of the event. If two episodes happen within 7 days of each other I would need to mark all values in and between the two episodes as a single episode. My end goal is to count the number of "episodes" per person and eventually exclude the values within the episode for later analysis.

I am able to use dplyr/mutate to make a new column to flag the individual sentinel events, but I'm having trouble figuring out how to group the rows into 7-day episodes.

Any help would be appreciated!

example dput df

structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3
), lab = c(0.9, 1.2, 1.4, 1, 1, 1.4, 1.6, 1.9, 1.5, 1.4, 1.4, 
1.6, 1.8, 1.5, 1.3, 1.3, 1, 1, 1.7, 1.4, 1.4, 1.6), date = structure(c(13375, 
13378, 13380, 13382, 13386, 16559, 16630, 16633, 16648, 17065, 
17070, 17091, 17093, 17096, 17172, 17225, 17871, 18033, 18158, 
18162, 18278, 18635), class = "Date")), row.names = c(NA, -22L
), class = c("tbl_df", "tbl", "data.frame"))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三人与歌 2025-01-22 02:11:49

只是部分答案，因为你的问题确实很复杂。我将解释如何将您的日期分组为“片段”。我建议的方法是使用层次聚类。使用“完整”方法并将截止值设置为 7 时，剧集中的所有日期都在 7 天半径内。相反，如果您希望剧集包含距离小于 7 天且至少有一个成员属于同一集群的所有日期，则应考虑“单一”链接方法。

df |>
  group_by(ID) |>
  mutate(event = lab > lab[1] * 1.5) |>
  mutate(episode = {
    dist(date) |>
      hclust(method = "complete") |>
      cutree(h = 7)
  }) |>
  group_by(ID, episode) |>
  mutate(flag = any(event)) |>
  as.data.frame()


##>    ID lab       date event episode  flag
##> 1   1 0.9 2006-08-15 FALSE       1  TRUE
##> 2   1 1.2 2006-08-18 FALSE       1  TRUE
##> 3   1 1.4 2006-08-20  TRUE       1  TRUE
##> 4   1 1.0 2006-08-22 FALSE       1  TRUE
##> 5   1 1.0 2006-08-26 FALSE       2 FALSE
##> 6   2 1.4 2015-05-04 FALSE       1 FALSE
##> 7   2 1.6 2015-07-14 FALSE       2 FALSE
##> 8   2 1.9 2015-07-17 FALSE       2 FALSE
##> 9   2 1.5 2015-08-01 FALSE       3 FALSE
##> 10  2 1.4 2016-09-21 FALSE       4 FALSE
##> 11  2 1.4 2016-09-26 FALSE       4 FALSE
##> 12  2 1.6 2016-10-17 FALSE       5 FALSE
##> 13  2 1.8 2016-10-19 FALSE       5 FALSE
##> 14  2 1.5 2016-10-22 FALSE       5 FALSE
##> 15  2 1.3 2017-01-06 FALSE       6 FALSE
##> 16  2 1.3 2017-02-28 FALSE       7 FALSE
##> 17  3 1.0 2018-12-06 FALSE       1 FALSE
##> 18  3 1.0 2019-05-17 FALSE       2 FALSE
##> 19  3 1.7 2019-09-19  TRUE       3  TRUE
##> 20  3 1.4 2019-09-23 FALSE       3  TRUE
##> 21  3 1.4 2020-01-17 FALSE       4 FALSE
##> 22  3 1.6 2021-01-08  TRUE       5  TRUE

Only a partial answer because your problem is really complex. I'll explain how to group your dates into "episodes". My suggested approach is to use hierarchical clustering. With the method "complete" and cutoff set to 7, all the dates within an episode are within a 7-day radius. If, instead, you want an episode to include all dates having a distance less than 7 days with at least one member of the same cluster, you should consider "single" linkage method.

df |>
  group_by(ID) |>
  mutate(event = lab > lab[1] * 1.5) |>
  mutate(episode = {
    dist(date) |>
      hclust(method = "complete") |>
      cutree(h = 7)
  }) |>
  group_by(ID, episode) |>
  mutate(flag = any(event)) |>
  as.data.frame()


##>    ID lab       date event episode  flag
##> 1   1 0.9 2006-08-15 FALSE       1  TRUE
##> 2   1 1.2 2006-08-18 FALSE       1  TRUE
##> 3   1 1.4 2006-08-20  TRUE       1  TRUE
##> 4   1 1.0 2006-08-22 FALSE       1  TRUE
##> 5   1 1.0 2006-08-26 FALSE       2 FALSE
##> 6   2 1.4 2015-05-04 FALSE       1 FALSE
##> 7   2 1.6 2015-07-14 FALSE       2 FALSE
##> 8   2 1.9 2015-07-17 FALSE       2 FALSE
##> 9   2 1.5 2015-08-01 FALSE       3 FALSE
##> 10  2 1.4 2016-09-21 FALSE       4 FALSE
##> 11  2 1.4 2016-09-26 FALSE       4 FALSE
##> 12  2 1.6 2016-10-17 FALSE       5 FALSE
##> 13  2 1.8 2016-10-19 FALSE       5 FALSE
##> 14  2 1.5 2016-10-22 FALSE       5 FALSE
##> 15  2 1.3 2017-01-06 FALSE       6 FALSE
##> 16  2 1.3 2017-02-28 FALSE       7 FALSE
##> 17  3 1.0 2018-12-06 FALSE       1 FALSE
##> 18  3 1.0 2019-05-17 FALSE       2 FALSE
##> 19  3 1.7 2019-09-19  TRUE       3  TRUE
##> 20  3 1.4 2019-09-23 FALSE       3  TRUE
##> 21  3 1.4 2020-01-17 FALSE       4 FALSE
##> 22  3 1.6 2021-01-08  TRUE       5  TRUE

回复收藏 0 原文

~没有更多了~