有效地计算时间间隔与数据重叠。
我有一个大的data.table
, 〜150万行的开始和结束posixct
times。对于这些行中的每一个,我想计算其余数据中与每个起始端的开始端间隔的百分比,并将高重叠行的子集保存到单独的表中。对于循环的,这很简单,但是我当前的方法需要很长时间才能运行。在下面的代码中,仅在我的桌面上循环前1,000行需要约20秒;似乎主要子集(即,
temp< - dt [end_time> focal $ start_time& start_time&ltt; focal $ end_time]
)是处理的大部分。还有另一种方法会更快吗?
library(data.table)
set.seed(123)
n <- 1500000
start_times <- as.POSIXct(runif(n, as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01")), origin = "1970-01-01")
end_times <- start_times + as.double(runif(n)*1000000)
dt <- data.table(
"id" = 1:n,
"start_time" = start_times,
"end_time" = end_times)
dt[, window_length_hours := as.double(end_time - start_time, units = "hours")]
start <- Sys.time()
choice_list <- data.table("focal_id" = character(), option_id = character(), overlap = double())
for(i in 1:1000){
focal <- dt[i]
temp <- dt[end_time > focal$start_time & start_time < focal$end_time]
temp[, window_overlap_pct := (focal$window_length_hours - pmax(0, as.double(focal$end_time - end_time, units="hours")) - pmax(0, as.double(start_time - focal$start_time, units="hours")))/focal$window_length_hours]
sample <- unique(temp[window_overlap_pct > .80][sample(.N, 2, replace = T)]) # save two rows that have high overlap
choice_list <- rbindlist(list(choice_list,
list("focal_id" = focal$id, "option_id" = sample$id, "overlap" = sample$window_overlap_pct),
list("focal_id" = focal$id, "option_id" = focal$id, "overlap" = NA)))
}
end <- Sys.time()
end-start
I have a large data.table
with ~1.5 million rows of start and end POSIXct
times. For each of these rows I want to calculate the percentage of the start-end interval that overalps with each start-end in the rest of the data, and save a subset of the high-overlap rows to a separate table. This is straightforward to do with a for
loop, however my current approach takes a very long time to run. In the code below, looping through just the first 1,000 rows takes ~20 seconds on my desktop; it seems the main subseting (i.e., temp <- dt[end_time > focal$start_time & start_time < focal$end_time]
) is the bulk of the processing. Is there another approach that would be faster?
library(data.table)
set.seed(123)
n <- 1500000
start_times <- as.POSIXct(runif(n, as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01")), origin = "1970-01-01")
end_times <- start_times + as.double(runif(n)*1000000)
dt <- data.table(
"id" = 1:n,
"start_time" = start_times,
"end_time" = end_times)
dt[, window_length_hours := as.double(end_time - start_time, units = "hours")]
start <- Sys.time()
choice_list <- data.table("focal_id" = character(), option_id = character(), overlap = double())
for(i in 1:1000){
focal <- dt[i]
temp <- dt[end_time > focal$start_time & start_time < focal$end_time]
temp[, window_overlap_pct := (focal$window_length_hours - pmax(0, as.double(focal$end_time - end_time, units="hours")) - pmax(0, as.double(start_time - focal$start_time, units="hours")))/focal$window_length_hours]
sample <- unique(temp[window_overlap_pct > .80][sample(.N, 2, replace = T)]) # save two rows that have high overlap
choice_list <- rbindlist(list(choice_list,
list("focal_id" = focal$id, "option_id" = sample$id, "overlap" = sample$window_overlap_pct),
list("focal_id" = focal$id, "option_id" = focal$id, "overlap" = NA)))
}
end <- Sys.time()
end-start
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用
foverlaps
:在3分钟内工作100000行,由于内存分配错误,无法扩展到100万行。
You could use
foverlaps
:Worked for 100 000 rows in 3 minutes, wasn't able to scale to 1 million rows due to memory allocation error.