在 R 中,使用日期范围内的事件计数创建变量

发布于 2025-01-10 15:00:41 字数 1774 浏览 0 评论 0原文

背景

我有一个R数据框d

d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
                event = c(1,1,0,0,1,1,1),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
                stringsAsFactors=FALSE)

如您所见,它的ID 列,他们是否发生过事件,以及记录其事件状态的日期 (event_date)。

问题

我想创建一个新的变量/列event_within_interval,它将1分配给给定所有单元格ID(如果该 ID第一次事件发生后 180 天内发生 2 个或更多事件=1 =1。

让我进一步解释一下:ID=aID=c 各有 2 个或更多事件,但只有 ID=c 有其 第二次事件发生在第一次事件发生后的 180 天内(因此,对于 ID=c 来说,2013 年 4 月 7 日 - 2013 年 3 月 14 日 = 24 天)。

问题是我不确定如何告诉 R “如果第二个事件发生在第一个 event=1 后 180 天内”。

我想要什么

这是我正在寻找的:

want <- data.frame(ID = c("a","a","b","b","c","c","c"),
                event = c(1,1,1,0,0,1,1),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
                event_within_interval = c(0,0,0,0,1,1,1),
                stringsAsFactors=FALSE)

我尝试过的

到目前为止我才刚刚开始尝试:

d <- d %>% 
  mutate(event_within_interval = ID %in% if_else(d$event == 1, 1, 0))

但这并没有给我我想要的,如果你运行代码你就可以知道。

我已将其设置为 if_else,但我不确定从这里该去哪里。

更新:我编辑了两个可重现的示例(我已经得到的和我想要的),以强调所需的日期间隔需要在第一个事件和第二个事件之间,而不是在第一个事件和最后一个事件。 (一些用户使用 last 提交了示例,该示例适用于可重现示例的上一次迭代,但不适用于真实数据集。)

Background

I've got an R dataframe d:

d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
                event = c(1,1,0,0,1,1,1),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
                stringsAsFactors=FALSE)

As you can see, it's got 3 distinct people in the ID column, and they've either had or not had an event, along with a date their event status was recorded (event_date).

The Problem

I'd like to create a new variable / column, event_within_interval, which assigns 1 to all the cells of a given ID if that ID has 2 or more event=1 within 180 days of their first event=1.

Let me explain further: both ID=a and ID=c have 2 or more events each, but only ID=c has their second event within 180 days of their first (so here, the 4/7/2013 - 3/14/2013 = 24 days for ID=c).

The problem is that I'm not sure how to tell R this idea of "if the second happens within 180 days of the first event=1".

What I'd like

Here's what I'm looking for:

want <- data.frame(ID = c("a","a","b","b","c","c","c"),
                event = c(1,1,1,0,0,1,1),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
                event_within_interval = c(0,0,0,0,1,1,1),
                stringsAsFactors=FALSE)

What I've tried

I've only got the beginnings of an attempt thus far:

d <- d %>% 
  mutate(event_within_interval = ID %in% if_else(d$event == 1, 1, 0))

But this doesn't give me what I'd like, as you can tell if you run the code.

I've set the thing up as an if_else, but I'm not sure where to go from here.

UPDATE: I've edited both reproducible examples (what I've got and what I want) to emphasize the fact that the desired date interval needs to be between the first event and the second event, not the first event and the last event. (A couple of users submitted examples using last, which worked for the previous iteration of the reproducible example but wouldn't have worked on the real dataset.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

柠檬心 2025-01-17 15:00:41

那么 lubridatedata.table 包怎么样?

library(data.table)
library(lubridate)

d <- data.frame(ID = c("a","a","b","b", "c","c"),
                event = c(1,1,0,0,1,1),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07")),
                stringsAsFactors=FALSE)

d <-  data.table(d)

d <- d[, event_within_interval := 0]

timeInterval <- interval(start = "2013-03-14", end = "2013-04-07")

d <- d[event == 1 & event_date %within% timeInterval, event_within_interval := 1]

d
#    ID event event_date event_within_interval
# 1:  a     1 2011-01-01                     0
# 2:  a     1 2012-08-21                     0
# 3:  b     0 2011-12-23                     0
# 4:  b     0 2011-12-31                     0
# 5:  c     1 2013-03-14                     1
# 6:  c     1 2013-04-07                     1

What about by packages lubridate and data.table?

library(data.table)
library(lubridate)

d <- data.frame(ID = c("a","a","b","b", "c","c"),
                event = c(1,1,0,0,1,1),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07")),
                stringsAsFactors=FALSE)

d <-  data.table(d)

d <- d[, event_within_interval := 0]

timeInterval <- interval(start = "2013-03-14", end = "2013-04-07")

d <- d[event == 1 & event_date %within% timeInterval, event_within_interval := 1]

d
#    ID event event_date event_within_interval
# 1:  a     1 2011-01-01                     0
# 2:  a     1 2012-08-21                     0
# 3:  b     0 2011-12-23                     0
# 4:  b     0 2011-12-31                     0
# 5:  c     1 2013-03-14                     1
# 6:  c     1 2013-04-07                     1
橘味果▽酱 2025-01-17 15:00:41

这很有趣。

场景 1

我的方法是

  • ID 对事件进行分组
  • 在当前日期和初始日期之间的两天内应用第一个条件检查
  • ,检查事件总和是否大于或等于 2:sum (event) >= 2
  • 仅当满足两个条件时,我才会为该事件返回一个

为了便于阅读,我已将数据中的条件值返回为 test_* 变量。


  d %>%
    group_by(ID) %>%
    mutate(test_interval = event_date - min(event_date) < 180,
           test_sum_events = sum(event) >= 2,
           event_within_interval = if_else(test_interval & test_sum_events, 
                                           1, 0)) %>%
    ungroup()

场景 2

在此场景中,数据按 ID 内的 event_date 排序,并且第一个事件和第二个事件之间的差异必须在 180 天以内。其余事件将被忽略。

d %>%
    group_by(ID) %>%
    arrange(event_date) %>%
    mutate(
        # Check the difference between first event: min(event_date) and
        # second event: event_date[2]
        test_interval_first_two = event_date[2] - min(event_date) <= 180,
        test_sum_events = sum(event) >= 2,
        event_within_interval = if_else(
          test_interval_first_two & test_sum_events, 1, 0)
    ) %>%
    ungroup()

This is good fun.

Scenario 1

My approach would be to

  • group events by ID
  • Apply first condition check on two the span of days between current date and initial date
  • check if the sum of events is bigger or equal two: sum(event) >= 2
  • only if the two conditions are met I would return one for the event

For readability, I've returned values of conditions in the data as test_* variables.


  d %>%
    group_by(ID) %>%
    mutate(test_interval = event_date - min(event_date) < 180,
           test_sum_events = sum(event) >= 2,
           event_within_interval = if_else(test_interval & test_sum_events, 
                                           1, 0)) %>%
    ungroup()

Scenario 2

In this scenario, the data is sorted by event_date within ID and the difference between the first event and second event has to be under 180 days. Rest of events is ignored.

d %>%
    group_by(ID) %>%
    arrange(event_date) %>%
    mutate(
        # Check the difference between first event: min(event_date) and
        # second event: event_date[2]
        test_interval_first_two = event_date[2] - min(event_date) <= 180,
        test_sum_events = sum(event) >= 2,
        event_within_interval = if_else(
          test_interval_first_two & test_sum_events, 1, 0)
    ) %>%
    ungroup()
如梦亦如幻 2025-01-17 15:00:41

您可以先对ID列进行group_by,以便我们可以计算同一ID内的天数。然后在 if_else 语句中的条件中,使用条件 with sum() > 1 AND 天差<= 180

在这里,我假设每个 ID 只有两个“事件”或行。

library(dplyr)

d %>% 
  group_by(ID) %>% 
  mutate(event_within_interval = if_else(sum(event) > 1 & last(event_date) - first(event_date) <= 180, 1L, 0L))

# A tibble: 6 x 4
# Groups:   ID [3]
  ID    event event_date event_within_interval
  <chr> <dbl> <date>                     <int>
1 a         1 2011-01-01                     0
2 a         1 2012-08-21                     0
3 b         0 2011-12-23                     0
4 b         0 2011-12-31                     0
5 c         1 2013-03-14                     1
6 c         1 2013-04-07                     1

You can first group_by the ID column, so that we can calculate days within the same ID. Then in the condition in the if_else statement, use condition with sum() > 1 AND day difference <= 180.

Here I assume there's only two "events" or rows per ID.

library(dplyr)

d %>% 
  group_by(ID) %>% 
  mutate(event_within_interval = if_else(sum(event) > 1 & last(event_date) - first(event_date) <= 180, 1L, 0L))

# A tibble: 6 x 4
# Groups:   ID [3]
  ID    event event_date event_within_interval
  <chr> <dbl> <date>                     <int>
1 a         1 2011-01-01                     0
2 a         1 2012-08-21                     0
3 b         0 2011-12-23                     0
4 b         0 2011-12-31                     0
5 c         1 2013-03-14                     1
6 c         1 2013-04-07                     1
萌逼全场 2025-01-17 15:00:41

我们可以这样做。在此示例中,使用附加列 interval 来查看间隔,然后使用 ifelse 语句。

library(dpylr)

d %>% 
  group_by(ID) %>% 
  mutate(interval = last(event_date)- first(event_date),
         event_within_interval = ifelse(event == 1 &
                                          interval < 180, 1, 0))
  ID    event event_date interval event_within_interval
  <chr> <dbl> <date>     <drtn>                   <dbl>
1 a         1 2011-01-01 598 days                     0
2 a         1 2012-08-21 598 days                     0
3 b         0 2011-12-23   8 days                     0
4 b         0 2011-12-31   8 days                     0
5 c         1 2013-03-14  24 days                     1
6 c         1 2013-04-07  24 days                     1

Here is how we could do it. In this example with an additional column interval to see the interval and then use an ifelse statement.

library(dpylr)

d %>% 
  group_by(ID) %>% 
  mutate(interval = last(event_date)- first(event_date),
         event_within_interval = ifelse(event == 1 &
                                          interval < 180, 1, 0))
  ID    event event_date interval event_within_interval
  <chr> <dbl> <date>     <drtn>                   <dbl>
1 a         1 2011-01-01 598 days                     0
2 a         1 2012-08-21 598 days                     0
3 b         0 2011-12-23   8 days                     0
4 b         0 2011-12-31   8 days                     0
5 c         1 2013-03-14  24 days                     1
6 c         1 2013-04-07  24 days                     1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文