我有2张桌子,
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
我想知道DF2中的开始/停止日期,在DF1的每一天发生的观察期数量。例如,在1/1/2020,观测A和B正在进行中,因此“ 2”。
预期输出将如下:
df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)
df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))
。
然后,我试图在其中嵌入“ Ifelse”,但是与NAS遇到同样的问题,
apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))
有人可以建议问题是什么吗?谢谢!
编辑:提出了一个间隔联接,这不是我想获得的 - 我认为用数字标签命名观察是引起混乱的原因。与进行1:1匹配相比,我试图找出观察结果的总数。
问候
唱歌
I have 2 tables
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
I would like to know the number of observation periods that occur on each day of df1, based on the start/stop dates in df2. E.g. on 1/1/2020, observations a and b were in progress, hence "2".
The expected output would be as follows:

I've tried using sums
df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)
But that only sums up the entire column values
I've then tried to create a custom function for this:
df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))
But it returns an NA value.
I then tried to do embed an "ifelse" within it, but get the same issue with NAs
apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))
Can anyone suggest what the issue is? Thanks!
edit: an interval join was suggested which is not what I'm trying to get - I think naming the observations with a numeric label was what caused confusion. I am trying to find out the TOTAL number of observations with periods that fall within the day, as compared to doing a 1:1 match.
Regards
Sing
发布评论
评论(2)
定义函数
f
中的比较,然后通过Outs
,rowsums
是您想要的。Define the comparison in a function
f
and pass it throughouter
,rowSums
is what you're looking for.这是使用
%
in lubridate package 中的%函数。这种方法类似于左JOIN基于日期间隔的列子集,但是存在一些重要差异,即使用sumparize()
而不是filter()
避免'丢失''日期where“ number” == 0,并且加入“ tarne()”,因为数据集之间没有公共列:在2022-06-27上由(v2.0.1)
此方法与您的实际数据有用吗?
Here is a potential solution using dplyr/tidyverse functions and the
%within%
function from the lubridate package. This approach is similar to Left Join Subset of Column Based on Date Interval, however there are some important differences i.e. usesummarise()
instead offilter()
to avoid 'losing' dates where "number" == 0, and join by 'character()' as there are no common columns between datasets:Created on 2022-06-27 by the reprex package (v2.0.1)
Does this approach work with your actual data?