有效地计算时间间隔与数据重叠。

发布于 2025-01-25 02:05:00 字数 1660 浏览 3 评论 0原文

我有一个大的data.table，〜150万行的开始和结束posixct times。对于这些行中的每一个，我想计算其余数据中与每个起始端的开始端间隔的百分比，并将高重叠行的子集保存到单独的表中。对于循环的，这很简单，但是我当前的方法需要很长时间才能运行。在下面的代码中，仅在我的桌面上循环前1,000行需要约20秒；似乎主要子集（即，temp＆lt; - dt [end_time＆gt; focal $ start_time＆amp; start_time＆ltt; focal $ end_time]）是处理的大部分。还有另一种方法会更快吗？

library(data.table)
set.seed(123)
n <- 1500000
start_times <- as.POSIXct(runif(n, as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01")), origin = "1970-01-01")
end_times <- start_times + as.double(runif(n)*1000000)
dt <- data.table(
  "id" = 1:n,
  "start_time" = start_times,
  "end_time" = end_times)
dt[, window_length_hours := as.double(end_time - start_time, units = "hours")]

start <- Sys.time()
choice_list <- data.table("focal_id" = character(), option_id = character(), overlap = double())
for(i in 1:1000){
  focal <- dt[i]
  temp <- dt[end_time > focal$start_time & start_time < focal$end_time]
  temp[, window_overlap_pct := (focal$window_length_hours - pmax(0, as.double(focal$end_time - end_time, units="hours")) - pmax(0, as.double(start_time - focal$start_time, units="hours")))/focal$window_length_hours]

  sample <- unique(temp[window_overlap_pct > .80][sample(.N, 2, replace = T)]) # save two rows that have high overlap
  
  choice_list <- rbindlist(list(choice_list,
                                 list("focal_id" = focal$id, "option_id" = sample$id, "overlap" = sample$window_overlap_pct),
                                 list("focal_id" = focal$id, "option_id" = focal$id, "overlap" = NA)))
}
end <- Sys.time()
end-start

原文

I have a large data.table with ~1.5 million rows of start and end POSIXct times. For each of these rows I want to calculate the percentage of the start-end interval that overalps with each start-end in the rest of the data, and save a subset of the high-overlap rows to a separate table. This is straightforward to do with a for loop, however my current approach takes a very long time to run. In the code below, looping through just the first 1,000 rows takes ~20 seconds on my desktop; it seems the main subseting (i.e., temp <- dt[end_time > focal$start_time & start_time < focal$end_time]) is the bulk of the processing. Is there another approach that would be faster?

library(data.table)
set.seed(123)
n <- 1500000
start_times <- as.POSIXct(runif(n, as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01")), origin = "1970-01-01")
end_times <- start_times + as.double(runif(n)*1000000)
dt <- data.table(
  "id" = 1:n,
  "start_time" = start_times,
  "end_time" = end_times)
dt[, window_length_hours := as.double(end_time - start_time, units = "hours")]

start <- Sys.time()
choice_list <- data.table("focal_id" = character(), option_id = character(), overlap = double())
for(i in 1:1000){
  focal <- dt[i]
  temp <- dt[end_time > focal$start_time & start_time < focal$end_time]
  temp[, window_overlap_pct := (focal$window_length_hours - pmax(0, as.double(focal$end_time - end_time, units="hours")) - pmax(0, as.double(start_time - focal$start_time, units="hours")))/focal$window_length_hours]

  sample <- unique(temp[window_overlap_pct > .80][sample(.N, 2, replace = T)]) # save two rows that have high overlap
  
  choice_list <- rbindlist(list(choice_list,
                                 list("focal_id" = focal$id, "option_id" = sample$id, "overlap" = sample$window_overlap_pct),
                                 list("focal_id" = focal$id, "option_id" = focal$id, "overlap" = NA)))
}
end <- Sys.time()
end-start

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

星軌x 2025-02-01 02:05:00

您可以使用foverlaps：

setkey(dt,start_time,end_time)
foverlaps(dt,dt)[,window_overlap_pct:=as.numeric(difftime(pmin(end_time,i.end_time),pmax(start_time,i.start_time),units='hours'))/
                                      as.numeric(difftime(end_time,start_time,units='hours'))][
                 window_overlap_pct>.8&window_overlap_pct<1][
                 ,unique(.SD[sample(.N,2,replace=T)]),by=.(start_time,end_time)]

                start_time            end_time    id  i.id        i.start_time          i.end_time window_overlap_pct
                    <POSc>              <POSc> <int> <int>              <POSc>              <POSc>              <num>
    1: 2000-01-04 04:07:07 2000-01-07 23:40:57   444    21 2000-01-01 18:11:32 2000-01-07 18:07:19          0.9392702
    2: 2000-01-04 04:07:07 2000-01-07 23:40:57   444  9856 2000-01-01 19:17:25 2000-01-07 11:32:50          0.8674663
    3: 2000-01-01 18:11:32 2000-01-07 18:07:19    21  9856 2000-01-01 19:17:25 2000-01-07 11:32:50          0.9466909
    4: 2000-01-01 18:11:32 2000-01-07 18:07:19    21  8668 2000-01-02 13:23:45 2000-01-13 14:45:43          0.8665766
    5: 2000-01-05 05:01:53 2000-01-07 14:43:53  3956   939 2000-01-05 16:25:01 2000-01-08 08:51:51          0.8026778
   ---                                                                                                               
13706: 2009-12-31 22:48:04 2010-01-09 15:03:55  8013  2751 2009-12-31 12:21:27 2010-01-08 22:16:44          0.9193986
13707: 2009-12-31 22:48:04 2010-01-09 15:03:55  8013  7056 2009-12-31 23:26:28 2010-01-10 20:01:14          0.9969279
13708: 2009-12-28 05:46:18 2010-01-08 19:13:54  4027  5221 2009-12-30 12:17:44 2010-01-09 05:40:29          0.8034898
13709: 2009-12-31 23:26:28 2010-01-10 20:01:14  7056  5221 2009-12-30 12:17:44 2010-01-09 05:40:29          0.8379163
13710: 2009-12-31 23:26:28 2010-01-10 20:01:14  7056  2751 2009-12-31 12:21:27 2010-01-08 22:16:44          0.8066545

在3分钟内工作100000行，由于内存分配错误，无法扩展到100万行。

You could use foverlaps:

setkey(dt,start_time,end_time)
foverlaps(dt,dt)[,window_overlap_pct:=as.numeric(difftime(pmin(end_time,i.end_time),pmax(start_time,i.start_time),units='hours'))/
                                      as.numeric(difftime(end_time,start_time,units='hours'))][
                 window_overlap_pct>.8&window_overlap_pct<1][
                 ,unique(.SD[sample(.N,2,replace=T)]),by=.(start_time,end_time)]

                start_time            end_time    id  i.id        i.start_time          i.end_time window_overlap_pct
                    <POSc>              <POSc> <int> <int>              <POSc>              <POSc>              <num>
    1: 2000-01-04 04:07:07 2000-01-07 23:40:57   444    21 2000-01-01 18:11:32 2000-01-07 18:07:19          0.9392702
    2: 2000-01-04 04:07:07 2000-01-07 23:40:57   444  9856 2000-01-01 19:17:25 2000-01-07 11:32:50          0.8674663
    3: 2000-01-01 18:11:32 2000-01-07 18:07:19    21  9856 2000-01-01 19:17:25 2000-01-07 11:32:50          0.9466909
    4: 2000-01-01 18:11:32 2000-01-07 18:07:19    21  8668 2000-01-02 13:23:45 2000-01-13 14:45:43          0.8665766
    5: 2000-01-05 05:01:53 2000-01-07 14:43:53  3956   939 2000-01-05 16:25:01 2000-01-08 08:51:51          0.8026778
   ---                                                                                                               
13706: 2009-12-31 22:48:04 2010-01-09 15:03:55  8013  2751 2009-12-31 12:21:27 2010-01-08 22:16:44          0.9193986
13707: 2009-12-31 22:48:04 2010-01-09 15:03:55  8013  7056 2009-12-31 23:26:28 2010-01-10 20:01:14          0.9969279
13708: 2009-12-28 05:46:18 2010-01-08 19:13:54  4027  5221 2009-12-30 12:17:44 2010-01-09 05:40:29          0.8034898
13709: 2009-12-31 23:26:28 2010-01-10 20:01:14  7056  5221 2009-12-30 12:17:44 2010-01-09 05:40:29          0.8379163
13710: 2009-12-31 23:26:28 2010-01-10 20:01:14  7056  2751 2009-12-31 12:21:27 2010-01-08 22:16:44          0.8066545

Worked for 100 000 rows in 3 minutes, wasn't able to scale to 1 million rows due to memory allocation error.

回复收藏 0 原文

~没有更多了~