r dplyr根据重叠的日期时间间隔加入表
我有两个表,label
其中包含 label
时间间隔(开始和结束),以及 items
每个表都有唯一的时间戳项目
。
我想根据时间戳来映射items
。如果项目时间戳落在标签
的时间间隔内,则该项目
属于该标签
。
例如,这两个项
a123
、b682
位于与label
X547<对应的时间间隔内/code>,所以
a123
、b682
属于X547
# Label, start - end time
have_label_start_end <- data.frame(label = c('X547', 'X285', 'X290')
, time = c(lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 9L, min = 58L, sec = 51, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 4L, sec = 54, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 8L, sec = 34, tz = "UTC"))) %>%
dplyr::mutate(start_time = time, stop_time = lead(time)) %>% dplyr::select(-time)
> have_label_start_end
label start_time stop_time
1 X547 2022-04-11 09:58:51 2022-04-11 10:04:54
2 X285 2022-04-11 10:04:54 2022-04-11 11:08:34
3 X290 2022-04-11 11:08:34 <NA>
# Item
have_item_time <- data.frame(item = c('a123', 'b682', 'c3324', 'd4343', 'e5343')
, timestamp = c(lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 9L, min = 59L, sec = 34, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 3L, sec = 13, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 5L, sec = 17, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 8L, sec = 35, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 10L, sec = 09, tz = "UTC")))
> have_item_time
item timestamp
1 a123 2022-04-11 09:59:34
2 b682 2022-04-11 10:03:13
3 c3324 2022-04-11 10:05:17
4 d4343 2022-04-11 11:08:35
5 e5343 2022-04-11 11:10:09
# Map Items to Label
want <- data.frame(label = c('X547', 'X547','X285', 'X290', 'X290'),
item = c('a123', 'b682', 'c3324', 'd4343', 'e5343')
, timestamp = c(lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 9L, min = 59L, sec = 34, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 3L, sec = 13, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 5L, sec = 17, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 8L, sec = 35, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 10L, sec = 09, tz = "UTC")))
> want
label item timestamp
1 X547 a123 2022-04-11 09:59:34
2 X547 b682 2022-04-11 10:03:13
3 X285 c3324 2022-04-11 10:05:17
4 X290 d4343 2022-04-11 11:08:35
5 X290 e5343 2022-04-11 11:10:09
I have two tables, the label
which contains the label
time interval (start and end), and the items
which have a unique timestamp for each item
.
I want to map the items
based on their timestamps. If the item timestamp falls within the label
's time interval, therefore the item
belongs to that label
.
For example these two items
a123
, b682
are in the time interval which corresponds to label
X547
, so a123
, b682
belongs to X547
# Label, start - end time
have_label_start_end <- data.frame(label = c('X547', 'X285', 'X290')
, time = c(lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 9L, min = 58L, sec = 51, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 4L, sec = 54, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 8L, sec = 34, tz = "UTC"))) %>%
dplyr::mutate(start_time = time, stop_time = lead(time)) %>% dplyr::select(-time)
> have_label_start_end
label start_time stop_time
1 X547 2022-04-11 09:58:51 2022-04-11 10:04:54
2 X285 2022-04-11 10:04:54 2022-04-11 11:08:34
3 X290 2022-04-11 11:08:34 <NA>
# Item
have_item_time <- data.frame(item = c('a123', 'b682', 'c3324', 'd4343', 'e5343')
, timestamp = c(lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 9L, min = 59L, sec = 34, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 3L, sec = 13, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 5L, sec = 17, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 8L, sec = 35, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 10L, sec = 09, tz = "UTC")))
> have_item_time
item timestamp
1 a123 2022-04-11 09:59:34
2 b682 2022-04-11 10:03:13
3 c3324 2022-04-11 10:05:17
4 d4343 2022-04-11 11:08:35
5 e5343 2022-04-11 11:10:09
# Map Items to Label
want <- data.frame(label = c('X547', 'X547','X285', 'X290', 'X290'),
item = c('a123', 'b682', 'c3324', 'd4343', 'e5343')
, timestamp = c(lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 9L, min = 59L, sec = 34, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 3L, sec = 13, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 10L, min = 5L, sec = 17, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 8L, sec = 35, tz = "UTC"),
lubridate::make_datetime(year = 2022L, month = 4L, day = 11L, hour = 11L, min = 10L, sec = 09, tz = "UTC")))
> want
label item timestamp
1 X547 a123 2022-04-11 09:59:34
2 X547 b682 2022-04-11 10:03:13
3 X285 c3324 2022-04-11 10:05:17
4 X290 d4343 2022-04-11 11:08:35
5 X290 e5343 2022-04-11 11:10:09
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以下方法将间隔转换为开始行和结束行,添加 item_time 表,然后填充这些间隔中的标签以填充 item_time 表。
这假设每一项都在一个且仅有一个区间内。如果某个项目可能超出任何间隔或超过一个间隔,我们就需要对此进行修改。
另一种方法可能是使用 fuzzyjoin 包,它可以容纳像这样的“非等价”连接(就像
data.table
和sqldf
一样) 。一项调整是我们需要用有效的时间戳替换任何NA
stop_times。在这种情况下,我们可以使用have_item_time
中的最大值。在这里,我们获取
have_item_time
中的特定时间戳位于(包含)have_label_start_end
中的任何间隔之间的所有匹配项。Here's an approach that converts the intervals to a start and end row, adds the item_time table, and then fills the labels from those intervals to populate the item_time table.
This assumes every item is within one and only one interval. If it's possible that an item could be outside of any intervals, or in more than one, we'd need to revise this.
Another approach could be to use the
fuzzyjoin
package, which accommodates "non-equi" joins like this (as dodata.table
andsqldf
). One tweak is that we need to replace anyNA
stop_times with a valid timestamp. In this case we can use the max fromhave_item_time
.Here, we get all the matches where the specific timestamp in
have_item_time
is between (inclusive) any of the intervals inhave_label_start_end
.我们可以使用{powerjoin}:
We can use {powerjoin} :
现在,只有DPLYR才能实现这一目标,使用join_by()函数代替参数,
https://stackoverflow.com/a/a/73498674/3848482
获得所需的结果:
It's now possible with only dplyr to achieve that, using the join_by() function in place of the by argument
https://stackoverflow.com/a/73498674/3848482
To get the desired result :