如何根据新变量的日期将现有的日期限制数据行拼接成两个新行？

发布于 2025-01-16 13:53:44 字数 2671 浏览 3 评论 0原文

在我的纵向数据集中，每行代表每个人的观察时间段，每行都以开始日期和结束日期为界。这些行已编号（“episode”），并且包含许多特定于行的变量（例如“edu_level”），我需要在以下步骤中保留这些变量。

我创建了一个新的日期变量 hx_start，它可以通过 3 种方式之一与每行数据的开始日期和结束日期相关（如下）。对于每种情况，我需要相应地编辑（拼接）现有数据行，使用 dplyr：

1。在给定行的开始日期和结束日期之间（即，对于第 2 个人和第 4 个人而言） 在本例中，我想将现有行拼接成两个新行，以便 hx_start 是其中一行的开始日期。另一行将保留原始行的开始日期和结束日期将是 hx_start 日期的前一天。

2.与某人的行开始日期（即人 1）同一天在这种情况下，不需要进行任何更改。

3.与某人的行结束日期（即第 3 个人）同一天与#1相同：我需要将现有行拼接成两个新行，以便 hx_start 的日期是其中一行的开始日期。另一行将保留原始行的开始日期和结束日期将是 hx_start 日期的前一天。

到目前为止，我已经创建了一个新数据集，其中每行有 2 个重复项，假设我需要为每个现有行编辑最多 2 行，然后删除原始数据（或者仅保留原始数据，对于 person 1).重要的是，如果可能的话，我需要一种方法将所有其他变量从原始行转移到所有新行，而无需全部命名（我的真实数据集中有很多变量）。

#Load packages
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

#Create data set
person <- c(1, 2, 3, 4)
episode <- c(33, 50, 65, 70)
start <- c('2013-01-01', '2010-01-21', '2009-09-18', '2010-05-26')
end <- c('2013-06-04', '2010-06-19', '2009-12-31', '2010-12-24')
hx_start <- c('2013-01-01', '2010-03-09', '2009-12-31', '2010-07-04')
edu_level <- c(2, 3, 2, 1)

#Populate data frame
d <- cbind(person, episode, start, hx_start, end, edu_level)
d <- as.data.frame(d)
#Format dates and add to data frame
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$hx_start <- as.Date(hx_start, format = '%Y-%m-%d')

#Create 2 duplicates of this row for each person 
d1 <- d[rep(seq_len(nrow(d)), each = 3), ]

d1
#>     person episode      start   hx_start        end edu_level
#> 1        1      33 2013-01-01 2013-01-01 2013-06-04         2
#> 1.1      1      33 2013-01-01 2013-01-01 2013-06-04         2
#> 1.2      1      33 2013-01-01 2013-01-01 2013-06-04         2
#> 2        2      50 2010-01-21 2010-03-09 2010-06-19         3
#> 2.1      2      50 2010-01-21 2010-03-09 2010-06-19         3
#> 2.2      2      50 2010-01-21 2010-03-09 2010-06-19         3
#> 3        3      65 2009-09-18 2009-12-31 2009-12-31         2
#> 3.1      3      65 2009-09-18 2009-12-31 2009-12-31         2
#> 3.2      3      65 2009-09-18 2009-12-31 2009-12-31         2
#> 4        4      70 2010-05-26 2010-07-04 2010-12-24         1
#> 4.1      4      70 2010-05-26 2010-07-04 2010-12-24         1
#> 4.2      4      70 2010-05-26 2010-07-04 2010-12-24         1

^{由 reprex 包 (v2.0.0) 创建于 2022 年 3 月 23 日}

原文

In my longitudinal data set, each row represents a time period of observation for each person, and each row is bounded by a start and end date. The rows are numbered ('episode'), and contain many row-specific variables (eg, 'edu_level') that I need to retain throughout the following steps.

I created a new date variable, hx_start, which can relate to the start and end date of each row of data in 1 of 3 ways (below). For each scenario, I need to edit (splice) the existing row of data accordingly, using dplyr:

1. Between a given row's start and end date (ie, as it does for persons 2 and 4)
In this case, I want to splice the existing row into two new ones, so that the date of
hx_start is the start date of one of the rows. The other row would retain the original row's
start date and its end date would be one day before the date of hx_start.

2. On the same date as someone's row start date (ie, person 1)
In this case, no change is needed.

3. On the same date as someone's row end date (ie, person 3)
Same as #1: I need to splice the existing row into two new ones, so that the date of hx_start
is the start date of one of the rows. The other row would retain the original row's
start date and its end date would be one day before the date of hx_start.

So far, I have created a new data set that has 2 duplicates of each row, assuming that I will need to edit up to 2 rows per existing row, and then drop the originals (or retain only the original, in the case of person 1). Importantly, I need a way to carry forward all of the other variables from the original row to all new rows without naming them all, if possible (there are many in my real data set).

#Load packages
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

#Create data set
person <- c(1, 2, 3, 4)
episode <- c(33, 50, 65, 70)
start <- c('2013-01-01', '2010-01-21', '2009-09-18', '2010-05-26')
end <- c('2013-06-04', '2010-06-19', '2009-12-31', '2010-12-24')
hx_start <- c('2013-01-01', '2010-03-09', '2009-12-31', '2010-07-04')
edu_level <- c(2, 3, 2, 1)

#Populate data frame
d <- cbind(person, episode, start, hx_start, end, edu_level)
d <- as.data.frame(d)
#Format dates and add to data frame
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$hx_start <- as.Date(hx_start, format = '%Y-%m-%d')

#Create 2 duplicates of this row for each person 
d1 <- d[rep(seq_len(nrow(d)), each = 3), ]

d1
#>     person episode      start   hx_start        end edu_level
#> 1        1      33 2013-01-01 2013-01-01 2013-06-04         2
#> 1.1      1      33 2013-01-01 2013-01-01 2013-06-04         2
#> 1.2      1      33 2013-01-01 2013-01-01 2013-06-04         2
#> 2        2      50 2010-01-21 2010-03-09 2010-06-19         3
#> 2.1      2      50 2010-01-21 2010-03-09 2010-06-19         3
#> 2.2      2      50 2010-01-21 2010-03-09 2010-06-19         3
#> 3        3      65 2009-09-18 2009-12-31 2009-12-31         2
#> 3.1      3      65 2009-09-18 2009-12-31 2009-12-31         2
#> 3.2      3      65 2009-09-18 2009-12-31 2009-12-31         2
#> 4        4      70 2010-05-26 2010-07-04 2010-12-24         1
#> 4.1      4      70 2010-05-26 2010-07-04 2010-12-24         1
#> 4.2      4      70 2010-05-26 2010-07-04 2010-12-24         1

^{Created on 2022-03-23 by the reprex package (v2.0.0)}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉墨 2025-01-23 13:53:44

您可以通过创建一个小的辅助函数来做到这一点。我已经使用 data.table 格式化

library(data.table)

f <- function(s,m,e) {
  if(m>s) return(list("start" = c(m,s),"hx_start" = c(m,m),"end" = c(e,m-1)))
  if(m == s) return (list("start" = s,"hx_start" = m,"end" =e))
}

setDT(d)[,!c(3:5)][d[ ,f(start,hx_start,end), by=person], on=.(person)]

输出完成此操作：

   person episode edu_level      start   hx_start        end
1:      1      33         2 2013-01-01 2013-01-01 2013-06-04
2:      2      50         3 2010-03-09 2010-03-09 2010-06-19
3:      2      50         3 2010-01-21 2010-03-09 2010-03-08
4:      3      65         2 2009-12-31 2009-12-31 2009-12-31
5:      3      65         2 2009-09-18 2009-12-31 2009-12-30
6:      4      70         1 2010-07-04 2010-07-04 2010-12-24
7:      4      70         1 2010-05-26 2010-07-04 2010-07-03

请注意：

对于人员 2,4，一行现在将 hx_start 作为开始日期，另一行具有原始开始日期，而结束日期是前一天hx_开始日期。
对于人员 1，没有任何更改
对于人员 3，一行现在将 hx_start 作为开始日期，另一行具有原始开始日期，而结束日期是 hx_start 日期的前一天。

Tidyverse 选项（也使用上面的函数）

inner_join(
  d %>% select(-c(start,hx_start,end)), 
  d %>% 
  rowwise() %>% 
  summarize(person = max(person),
            dates = list(f(start,hx_start,end))) %>% 
  unnest_wider(dates) %>% 
  unnest(cols=everything()), 
  by = "person"
)

输出：

   person episode edu_level      start   hx_start        end
1:      1      33         2 2013-01-01 2013-01-01 2013-06-04
2:      2      50         3 2010-03-09 2010-03-09 2010-06-19
3:      2      50         3 2010-01-21 2010-03-09 2010-03-08
4:      3      65         2 2009-12-31 2009-12-31 2009-12-31
5:      3      65         2 2009-09-18 2009-12-31 2009-12-30
6:      4      70         1 2010-07-04 2010-07-04 2010-12-24
7:      4      70         1 2010-05-26 2010-07-04 2010-07-03

You can do this by creating a small helper function. I've done this using data.table formatting

library(data.table)

f <- function(s,m,e) {
  if(m>s) return(list("start" = c(m,s),"hx_start" = c(m,m),"end" = c(e,m-1)))
  if(m == s) return (list("start" = s,"hx_start" = m,"end" =e))
}

setDT(d)[,!c(3:5)][d[ ,f(start,hx_start,end), by=person], on=.(person)]

Output:

   person episode edu_level      start   hx_start        end
1:      1      33         2 2013-01-01 2013-01-01 2013-06-04
2:      2      50         3 2010-03-09 2010-03-09 2010-06-19
3:      2      50         3 2010-01-21 2010-03-09 2010-03-08
4:      3      65         2 2009-12-31 2009-12-31 2009-12-31
5:      3      65         2 2009-09-18 2009-12-31 2009-12-30
6:      4      70         1 2010-07-04 2010-07-04 2010-12-24
7:      4      70         1 2010-05-26 2010-07-04 2010-07-03

Notice that:

For person 2,4, one row now has hx_start as the start date, and the other row has the original start date, while the end date is one day before the hx_start date.
For person 1, there has been no change
For person 3, one row now has hx_start as the start date, and the other row has the original start date, while the end date is one day before the hx_start date.

Tidyverse option (also uses function above)

inner_join(
  d %>% select(-c(start,hx_start,end)), 
  d %>% 
  rowwise() %>% 
  summarize(person = max(person),
            dates = list(f(start,hx_start,end))) %>% 
  unnest_wider(dates) %>% 
  unnest(cols=everything()), 
  by = "person"
)

Output:

   person episode edu_level      start   hx_start        end
1:      1      33         2 2013-01-01 2013-01-01 2013-06-04
2:      2      50         3 2010-03-09 2010-03-09 2010-06-19
3:      2      50         3 2010-01-21 2010-03-09 2010-03-08
4:      3      65         2 2009-12-31 2009-12-31 2009-12-31
5:      3      65         2 2009-09-18 2009-12-31 2009-12-30
6:      4      70         1 2010-07-04 2010-07-04 2010-12-24
7:      4      70         1 2010-05-26 2010-07-04 2010-07-03

回复收藏 0 原文

~没有更多了~