查找重叠间隔

发布于 2024-12-15 21:10:36 字数 2577 浏览 4 评论 0原文

上下文：据我所知，R 缺乏一致的函数，这些函数有助于在生存/事件历史分析的上下文中进行数据准备，例如片段分割以包含时变协变量（有时称为'计算过程数据'）。

对于每个人 (id)，给出每集的开始时间 (start.cp) 和结束时间 (stop.cp)。此外，对于每个 1,2, ..., p 时变协变量 (TVC)，我们知道剧集何时开始 (tvc.start_) 以及何时开始它结束（tvc.stop_）。

在我的示例中（见下文），TVC 的数量为 2，但通常该数量可能会有所不同（从 1 到 p）。

示例：

输入数据：

  id start.cp stop.cp tvc.start1 tvc.start2 tvc.stop1 tvc.stop2
1  1        1       2          2          3         4         7
2  1        2       3          2          3         4         7
3  1        3       4          2          3         4         7
4  1        4       7          2          3         4         7
5  1        7      12          2          3         4         7

structure(list(id = c(1, 1, 1, 1, 1), start.cp = c(1, 2, 3, 4, 
7), stop.cp = c(2, 3, 4, 7, 12), tvc.start1 = c(2, 2, 2, 2, 2
), tvc.start2 = c(3, 3, 3, 3, 3), tvc.stop1 = c(4, 4, 4, 4, 4
), tvc.stop2 = c(7, 7, 7, 7, 7)), .Names = c("id", "start.cp", 
"stop.cp", "tvc.start1", "tvc.start2", "tvc.stop1", "tvc.stop2"), 
row.names = c(NA, 5L), class = "data.frame")

TVC 的名称已知，即在此示例中已知

tvc.start <- c("tvc.start1", "tvc.start2") 
tvc.stop <- c("tvc.stop1", "tvc.stop2")

预期结果：

  id start.cp stop.cp tvc.start1 tvc.start2 tvc.stop1 tvc.stop2 tvc.d1 tvc.d2
1  1        1       2          2          3         4         7      0      0
2  1        2       3          2          3         4         7      1      0
3  1        3       4          2          3         4         7      1      0
4  1        4       7          2          3         4         7      0      1
5  1        7      12          2          3         4         7      0      1

structure(list(id = c(1, 1, 1, 1, 1), start.cp = c(1, 2, 3, 4, 
7), stop.cp = c(2, 3, 4, 7, 12), tvc.start1 = c(2, 2, 2, 2, 2
), tvc.start2 = c(3, 3, 3, 3, 3), tvc.stop1 = c(4, 4, 4, 4, 4
), tvc.stop2 = c(7, 7, 7, 7, 7), tvc.d1 = c(0, 1, 1, 0, 0), tvc.d2 = c(0, 
0, 0, 1, 1)), .Names = c("id", "start.cp", "stop.cp", "tvc.start1", 
"tvc.start2", "tvc.stop1", "tvc.stop2", "tvc.d1", "tvc.d2"), row.names = c(NA, 
5L), class = "data.frame")

问题： 对于每个 TVC，我想创建一个新向量（tvc.d1、tvc.d2，请参阅示例），该向量指示给定的剧集（定义通过 start.cp和 stop.cp) 与 TVC 的间隔重叠 (=1)。假设[start.cp, stop.cp)。如何在不循环 TVC 集的情况下完成此操作，即我正在寻找矢量化解决方案。

PS：请随意更改标题...

原文

Context: As far as I can see, R lacks consistent functions which facilitate the data preparation in the context of survival/event history analysis, e.g. episode-splitting to include time-varying covariates (sometimes refered to as 'counting process data').

For each individual (id), the start (start.cp) and end time (stop.cp) of each episode is given. Furthermore, for each of the 1,2, ..., p time-varying covariates (TVC), we know when the episode starts (tvc.start_) and when it ends (tvc.stop_).

In my example (see below) the number of TVCs is 2 but usually the number can vary (from 1 to p).

Example:

Input data:

  id start.cp stop.cp tvc.start1 tvc.start2 tvc.stop1 tvc.stop2
1  1        1       2          2          3         4         7
2  1        2       3          2          3         4         7
3  1        3       4          2          3         4         7
4  1        4       7          2          3         4         7
5  1        7      12          2          3         4         7

structure(list(id = c(1, 1, 1, 1, 1), start.cp = c(1, 2, 3, 4, 
7), stop.cp = c(2, 3, 4, 7, 12), tvc.start1 = c(2, 2, 2, 2, 2
), tvc.start2 = c(3, 3, 3, 3, 3), tvc.stop1 = c(4, 4, 4, 4, 4
), tvc.stop2 = c(7, 7, 7, 7, 7)), .Names = c("id", "start.cp", 
"stop.cp", "tvc.start1", "tvc.start2", "tvc.stop1", "tvc.stop2"), 
row.names = c(NA, 5L), class = "data.frame")

The names of the TVCs are known, i.e. in this example it is known that

tvc.start <- c("tvc.start1", "tvc.start2") 
tvc.stop <- c("tvc.stop1", "tvc.stop2")

Expected results:

  id start.cp stop.cp tvc.start1 tvc.start2 tvc.stop1 tvc.stop2 tvc.d1 tvc.d2
1  1        1       2          2          3         4         7      0      0
2  1        2       3          2          3         4         7      1      0
3  1        3       4          2          3         4         7      1      0
4  1        4       7          2          3         4         7      0      1
5  1        7      12          2          3         4         7      0      1

structure(list(id = c(1, 1, 1, 1, 1), start.cp = c(1, 2, 3, 4, 
7), stop.cp = c(2, 3, 4, 7, 12), tvc.start1 = c(2, 2, 2, 2, 2
), tvc.start2 = c(3, 3, 3, 3, 3), tvc.stop1 = c(4, 4, 4, 4, 4
), tvc.stop2 = c(7, 7, 7, 7, 7), tvc.d1 = c(0, 1, 1, 0, 0), tvc.d2 = c(0, 
0, 0, 1, 1)), .Names = c("id", "start.cp", "stop.cp", "tvc.start1", 
"tvc.start2", "tvc.stop1", "tvc.stop2", "tvc.d1", "tvc.d2"), row.names = c(NA, 
5L), class = "data.frame")

Question: For each TVC, I would like to create a new vector (tvc.d1, tvc.d2, see example) which indicates that a given episode (defined by start.cp and stop.cp) overlaps (=1) the interval of a TVC. It is assumed that [start.cp, stop.cp). How can this be done without looping over the set of TVCs, i.e. I am looking for a vectorized solution.

P.S.: Please feel free to change the title...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泪是无色的血 2024-12-22 21:10:36

我认为 Terry Therneau 可能想对您的主张提出异议，推荐的生存包中的 tcut 函数和 pyears 早在他与 Cindy Crowson 合作撰写的关于处理时间相关协变量的技术文章。我很难理解为什么 tcv.d1 应该在间隔 2 -> 2 期间贡献曝光。 3 当其停止时间为 2 时？但对后来读者的解释是在问题的评论中。

您实际上只需要 start.cp stop.cp 向量和第一行作为输入数据。您将间隔定义向量与每个组件/个体的开始和停止向量的向量进行比较，并找到 == '1' 的间隔。我想知道数据是否真的不是以这种方式出现的，您可能不需要在设置中重复开始和停止时间。

tvec <- with(dat, c(start.cp[1], stop.cp))
dat$tvc.d1 <- 1*( findInterval(tvec,      # the "1*" converts to numeric
                               as.numeric( dat[ 1, c("tvc.start1", "tvc.stop1")]) ,  
                               all.inside=FALSE)[1:5] == 1)
dat$tvc.d2 <- 1*( findInterval(tvec, 
                               as.numeric( dat[ 1, c("tvc.start2", "tvc.stop2")]) ,  
                               all.inside=FALSE)[1:5] == 1)

I think Terry Therneau might want to dispute your claim, The tcut function and the pyearsin the recommended survival package are described early in his technical article with Cindy Crowson on handling time-dependent covariates. I had trouble understanding why should tcv.d1 be contributing exposure during the interval 2 -> 3 when its stop time was 2? But the explanation for later readers is in the comments to the question.

You really only need the start.cp stop.cp vectors and the first line as input data. You compare the interval defining vector to the vector of each component/indivdiual's start and stop vector and find the intervals that == '1's. I'm ondering if the data doesn't really come in this way and you might not need to do the duplication of start and stop times in your setup.

tvec <- with(dat, c(start.cp[1], stop.cp))
dat$tvc.d1 <- 1*( findInterval(tvec,      # the "1*" converts to numeric
                               as.numeric( dat[ 1, c("tvc.start1", "tvc.stop1")]) ,  
                               all.inside=FALSE)[1:5] == 1)
dat$tvc.d2 <- 1*( findInterval(tvec, 
                               as.numeric( dat[ 1, c("tvc.start2", "tvc.stop2")]) ,  
                               all.inside=FALSE)[1:5] == 1)

回复收藏 0 原文

~没有更多了~