如何将两个数据帧与日期进行比较,返回特定间隔内的匹配日期并为新数据帧中的每一行标记不匹配的日期

发布于 2025-01-19 14:08:05 字数 1349 浏览 0 评论 0原文

我有一个日期框,每行中的每个主题都有多个测量日期,另一个数据框有每行中同一主题的多个访问日期(还包括一些 NA)。

我想要的是提取与特定时间间隔内某个主题的访问日期相匹配的测量日期(例如访问日期的 +/- 10 天),并标记不属于该时间间隔的测量日期(例如,使用'FALSE' 或 -99),并保持 NA 不变。

提出了类似的问题此处,但不允许测量日期在访问日期的间隔期内。

set.seed(1)

# Dataframe with measure dates
df1 <- rbind.data.frame(sort(sample(seq(as.Date("2018-01-01"), as.Date("2019-01-01"), by = "day"), 10)),
                        c(sort(sample(seq(as.Date("2018-06-01"), as.Date("2019-06-01"), by = "day"), 8)), NA, NA),
                        c(sort(sample(seq(as.Date("2019-06-01"), as.Date("2020-06-01"), by = "day"), 6)), rep(NA, 4)))
names(df1) <- paste("MEASUREDATE", 1:10, sep = "")

myfun <- function(x) as.Date(x, format = "%Y-%m-%d", origin = "1970-01-01")
df1 <- data.frame(lapply(df1, myfun))
df1

# Dataframe with visit dates
df2 <- rbind.data.frame(as.numeric(df1[1, 2:7]), as.numeric(c(df1[2, 4:6], NA, NA, NA)), as.numeric(c(df1[3, 1:2], rep(NA, 4))))
df2 <- data.frame(lapply(df2, myfun))
names(df2) <- paste("VISIT", 1:6, sep = "")
df2

所以新数据框的第一行是这样的:

# New dataframe
df3 <- df1[1, ]
df3[1] <- FALSE
df3[9:10] <- FALSE
df3

你知道如何解决这个问题吗?非常感谢任何帮助。

I have a dateframe with multiple measuring dates for each subjects in each row, and another dataframe with multiple visit dates for the same subject in each row (also including some NA's).

What I want is to extract the measuring dates that match the visit dates for a certain subject within a specific interval (say +/- 10 days from visit date), and tag the measuring dates that do not fall within this interval (e.g, with a 'FALSE' or -99), and keep the NA's as is.

A similar question was asked here, but did not allow for measuring dates to be within an interval period from visit date.

set.seed(1)

# Dataframe with measure dates
df1 <- rbind.data.frame(sort(sample(seq(as.Date("2018-01-01"), as.Date("2019-01-01"), by = "day"), 10)),
                        c(sort(sample(seq(as.Date("2018-06-01"), as.Date("2019-06-01"), by = "day"), 8)), NA, NA),
                        c(sort(sample(seq(as.Date("2019-06-01"), as.Date("2020-06-01"), by = "day"), 6)), rep(NA, 4)))
names(df1) <- paste("MEASUREDATE", 1:10, sep = "")

myfun <- function(x) as.Date(x, format = "%Y-%m-%d", origin = "1970-01-01")
df1 <- data.frame(lapply(df1, myfun))
df1

# Dataframe with visit dates
df2 <- rbind.data.frame(as.numeric(df1[1, 2:7]), as.numeric(c(df1[2, 4:6], NA, NA, NA)), as.numeric(c(df1[3, 1:2], rep(NA, 4))))
df2 <- data.frame(lapply(df2, myfun))
names(df2) <- paste("VISIT", 1:6, sep = "")
df2

So the fist row of the new dataframe would be like this:

# New dataframe
df3 <- df1[1, ]
df3[1] <- FALSE
df3[9:10] <- FALSE
df3

Do you know how to tackle this problem? Any help is very much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

颜漓半夏 2025-01-26 14:08:05

这是data.table解决方案。在二次到持久的线路中,缺少的访问日将其设置为1-1-1970(不可能进行NA,或者它们将与当前的NA混合。它必须是日期)。
如果日期格式是必需的,则可以切换到charact5er并使用您喜欢的任何值...

library(data.table)
# set as data.table
setDT(df1); setDT(df2)
# add subject numbering
df1[, id := .I]
df2[, id := .I]
# melt to long format
df1.melt <- melt(setDT(df1), id.vars = "id")
df2.melt <- melt(setDT(df2), id.vars = "id")
# add margins arround visit dates
df2.melt[, `:=`(mindate = value - 10, maxdate = value + 10)][]
# join visitdays within 10 days of measure (non-equi join)
df1.melt[df2.melt, visitdate := i.value, on = .(id, value >= mindate, value <= maxdate)]
# set missing visitdates to 31-12-2099 (keep date format)
df1.melt[!is.na(value) & is.na(visitdate), visitdate := 0]
# last step is to cast to wide again
dcast(df1.melt, id ~ variable, value.var = "visitdate")

#    id MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
# 1:  1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-10-04   2018-10-04   2018-10-26   2018-10-26   1970-01-01    1970-01-01
# 2:  2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2019-01-03   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
# 3:  3   2019-08-28   2020-03-15   2020-03-15   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>

here is a data.table solution. In the second-to-last line, missing visitdates are set to 1-1-1970 (NA is not possible, or they would mix with the current NA.. and it will have to be a date).
If the date-format is nog necessairy, you can switch to charact5er and fill use any value you like...

library(data.table)
# set as data.table
setDT(df1); setDT(df2)
# add subject numbering
df1[, id := .I]
df2[, id := .I]
# melt to long format
df1.melt <- melt(setDT(df1), id.vars = "id")
df2.melt <- melt(setDT(df2), id.vars = "id")
# add margins arround visit dates
df2.melt[, `:=`(mindate = value - 10, maxdate = value + 10)][]
# join visitdays within 10 days of measure (non-equi join)
df1.melt[df2.melt, visitdate := i.value, on = .(id, value >= mindate, value <= maxdate)]
# set missing visitdates to 31-12-2099 (keep date format)
df1.melt[!is.na(value) & is.na(visitdate), visitdate := 0]
# last step is to cast to wide again
dcast(df1.melt, id ~ variable, value.var = "visitdate")

#    id MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
# 1:  1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-10-04   2018-10-04   2018-10-26   2018-10-26   1970-01-01    1970-01-01
# 2:  2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2019-01-03   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
# 3:  3   2019-08-28   2020-03-15   2020-03-15   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>
喜爱纠缠 2025-01-26 14:08:05

正如 Wimpel 所说,同一列中不能有逻辑值和日期。所以我将使用 1970-01-01 作为 FALSE 值。

使用 dplyr 的解决方案

library(dplyr)
# convert a row from a Date dataframe to a Date vector
convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# given a Date vector where columns 1:10 are measurement date and
# 11:16 visit dates, create a logical vector of length 10 where
# the value is TRUE if the corresponding measurement column
# is within 10 days of any of the visit dates
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, any))
}
# temporary dataframe of logical values for all checks on all dates
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# create a result dataframe replacing logicals with corresponding dates
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
}

输出

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2         <NA>         <NA>         <NA>   2018-11-12   2018-12-30   2019-01-03         <NA>         <NA>         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16         <NA>         <NA>         <NA>         <NA>         <NA>         <NA>          <NA>

存在一些 NA 值,因为某些访问日期是 NA。因此 check_within_10d 函数无法确定缺失的访问日期之一是否在测量日期的 10 个日期之内。

如果您想忽略支票中缺失的访问日期,请使用

convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# changed function to any(na.rm=TRUE)
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, function(x){any(x,na.rm=T)}))
}
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# replace missing measurement values to NA
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
  df3[,i] <- if_else(is.na(df1[,i]), df1[,i], df3[,i])
}

输出

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2018-12-30   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>

As Wimpel said, you cannot have a logical and a Date in the same column. So I will use 1970-01-01 as the FALSE value.

A solution using dplyr

library(dplyr)
# convert a row from a Date dataframe to a Date vector
convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# given a Date vector where columns 1:10 are measurement date and
# 11:16 visit dates, create a logical vector of length 10 where
# the value is TRUE if the corresponding measurement column
# is within 10 days of any of the visit dates
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, any))
}
# temporary dataframe of logical values for all checks on all dates
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# create a result dataframe replacing logicals with corresponding dates
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
}

Output

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2         <NA>         <NA>         <NA>   2018-11-12   2018-12-30   2019-01-03         <NA>         <NA>         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16         <NA>         <NA>         <NA>         <NA>         <NA>         <NA>          <NA>

Some NA values are there because some visit date are NA. So the check_within_10d function cannot be sure that one of the missing visit dates is within 10 dates of a measurement date.

If you want to ignore the missing visit dates in your check, use

convert_to_vector <- function(row){
  return(row %>% t %>% as.Date)
}
# changed function to any(na.rm=TRUE)
check_within_10d <- function(row){
  return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, function(x){any(x,na.rm=T)}))
}
df_lgl <- cbind(df1,df2) %>% 
  apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
  data.frame %>% 
  t
# replace missing measurement values to NA
df3 <- df1
for(i in 1:ncol(df3)){ 
  df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
  df3[,i] <- if_else(is.na(df1[,i]), df1[,i], df3[,i])
}

Output

> df3
  MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2018-12-30   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文