当前位置：文江博客话题详情

将函数应用于行，但引用不同的表

发布于 2025-02-11 06:25:53 字数 1044 浏览 1 评论 0 原文

我有2张桌子，

df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))

我想知道DF2中的开始/停止日期，在DF1的每一天发生的观察期数量。例如，在1/1/2020，观测A和B正在进行中，因此“ 2”。

预期输出将如下：

df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)

df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))

。

然后，我试图在其中嵌入“ Ifelse”，但是与NAS遇到同样的问题，

apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))

有人可以建议问题是什么吗？谢谢！

编辑：提出了一个间隔联接，这不是我想获得的 - 我认为用数字标签命名观察是引起混乱的原因。与进行1：1匹配相比，我试图找出观察结果的总数。

问候唱歌

原文

I have 2 tables

df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))

I would like to know the number of observation periods that occur on each day of df1, based on the start/stop dates in df2. E.g. on 1/1/2020, observations a and b were in progress, hence "2".

The expected output would be as follows:

I've tried using sums

df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)

But that only sums up the entire column values

I've then tried to create a custom function for this:

df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))

But it returns an NA value.

I then tried to do embed an "ifelse" within it, but get the same issue with NAs

apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))

Can anyone suggest what the issue is? Thanks!

edit: an interval join was suggested which is not what I'm trying to get - I think naming the observations with a numeric label was what caused confusion. I am trying to find out the TOTAL number of observations with periods that fall within the day, as compared to doing a 1:1 match.

Regards
Sing

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空城缀染半城烟沙 2025-02-18 06:25:53

定义函数 f 中的比较，然后通过 Outs ， rowsums 是您想要的。

f <- \(x, y) df1[x, 1] >= df2[y, 2] & df1[x, 1] <= df2[y, 3]
cbind(df1, number=rowSums(outer(1:nrow(df1), 1:nrow(df2), f)))
#         dates number
# 1  2020-01-01      2
# 2  2020-01-02      2
# 3  2020-01-03      1
# 4  2020-01-04      0
# 5  2020-01-05      1
# 6  2020-01-06      1
# 7  2020-01-07      1
# 8  2020-01-08      1
# 9  2020-01-09      1
# 10 2020-01-10      2

Define the comparison in a function f and pass it through outer, rowSums is what you're looking for.

f <- \(x, y) df1[x, 1] >= df2[y, 2] & df1[x, 1] <= df2[y, 3]
cbind(df1, number=rowSums(outer(1:nrow(df1), 1:nrow(df2), f)))
#         dates number
# 1  2020-01-01      2
# 2  2020-01-02      2
# 3  2020-01-03      1
# 4  2020-01-04      0
# 5  2020-01-05      1
# 6  2020-01-06      1
# 7  2020-01-07      1
# 8  2020-01-08      1
# 9  2020-01-09      1
# 10 2020-01-10      2

回复收藏 0 原文

上课铃就是安魂曲 2025-02-18 06:25:53

这是使用 ％ in lubridate package 中的％函数。这种方法类似于左JOIN基于日期间隔的列子集，但是存在一些重要差异，即使用 sumparize（）而不是 filter（）避免'丢失''日期where“ number” == 0，并且加入“ tarne（）”，因为数据集之间没有公共列：

library(dplyr)
library(lubridate)

df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), 
                                 as.Date("2020-1-10"),
                                 by = "days")))
df2 = data.frame("observations" = c("1", "2", "3", "4"), 
                 "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), 
                 "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
df1 %>%
  full_join(df2, by = character()) %>%
  mutate(number = dates %within% interval(start, end)) %>%
  group_by(dates) %>%
  summarise(number = sum(number))
#> # A tibble: 10 × 2
#>    dates      number
#>    <date>      <dbl>
#>  1 2020-01-01      2
#>  2 2020-01-02      2
#>  3 2020-01-03      1
#>  4 2020-01-04      0
#>  5 2020-01-05      1
#>  6 2020-01-06      1
#>  7 2020-01-07      1
#>  8 2020-01-08      1
#>  9 2020-01-09      1
#> 10 2020-01-10      2

^{在2022-06-27上由（v2.0.1）}

此方法与您的实际数据有用吗？

Here is a potential solution using dplyr/tidyverse functions and the %within% function from the lubridate package. This approach is similar to Left Join Subset of Column Based on Date Interval, however there are some important differences i.e. use summarise() instead of filter() to avoid 'losing' dates where "number" == 0, and join by 'character()' as there are no common columns between datasets:

library(dplyr)
library(lubridate)

df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), 
                                 as.Date("2020-1-10"),
                                 by = "days")))
df2 = data.frame("observations" = c("1", "2", "3", "4"), 
                 "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), 
                 "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
df1 %>%
  full_join(df2, by = character()) %>%
  mutate(number = dates %within% interval(start, end)) %>%
  group_by(dates) %>%
  summarise(number = sum(number))
#> # A tibble: 10 × 2
#>    dates      number
#>    <date>      <dbl>
#>  1 2020-01-01      2
#>  2 2020-01-02      2
#>  3 2020-01-03      1
#>  4 2020-01-04      0
#>  5 2020-01-05      1
#>  6 2020-01-06      1
#>  7 2020-01-07      1
#>  8 2020-01-08      1
#>  9 2020-01-09      1
#> 10 2020-01-10      2

^{Created on 2022-06-27 by the reprex package (v2.0.1)}

Does this approach work with your actual data?

回复收藏 0 原文

~没有更多了~