有效计数并分类在指定间隔内发生的事件
我正在尝试分析一个大型数据集(> 150万观察值),以寻找事件的时间和位置之间的相关性。不幸的是,我在创建分析数据集时遇到了性能问题。
本质上,一个事件可以发生在几个设施之一(a,b和c)之一,每个设施的几个站点(1-6)以及日期范围(1990年1月至1990年6月1日之间),并且有A结果值为1和0。站点彼此接近,因此站点2的事件可能会影响站点1、2和3。事件也具有持久的效果,因此3月10日的事件可能会通过3月17日左右。
这里有一些示例数据:
set.seed(12345)
df <- data.frame(
facility = sample(
c("A","B","C"),
100,
replace=TRUE),
site = sample(
1:6,
100,
replace=TRUE),
date = as.Date(
sample(
c(lubridate::ymd("1990-1-1"):lubridate::ymd("1990-6-1")),
100,
replace=TRUE),
origin = "1970-01-01"
),
outcome = sample(
c(0,1),100,
replace=TRUE),
stringsAsFactors = FALSE
)
到目前为止,我设法获得了每次迭代的工作:
# A place to put the output
outputdf <- data.frame(
facility = character(),
site = numeric(),
date = as.Date(character()),
outcome = numeric(),
recent_success = integer(),
recent_failures = integer(),
stringsAsFactors = FALSE
)
# Loop through each iteration
for(i in 1:nrow(df)){
# Let me know how things are going in the console
print(paste("Event ",i," of ",nrow(df),sep=""))
#Choose just one event at a time
EventofInterest <- df[i,]
# Get site and facility information for that event
facility_of_interest <- EventofInterest$facility %>%
unlist()
site_of_interest <- EventofInterest$site %>%
unlist()
# Count up recent successes
recent_success <- df %>%
filter(outcome == 1,
facility %in% facility_of_interest,
site %in% c((site-1),site,(site+1)),
date %within% lubridate::interval(date-7,date)) %>%
nrow()
# Count up recent failures
recent_failures <- df %>%
filter(outcome == 0,
facility %in% facility_of_interest,
site %in% c((site-1),site,(site+1)),
date %within% lubridate::interval(date-7,date)) %>%
nrow()
# Create an output dataframe with the tallied successes and failures
outputdf <- EventofInterest %>%
mutate(recent_success = recent_success,
recent_failures = recent_failures
) %>%
# Bind that to the existing output dataframe
bind_rows(outputdf)
}
它甚至为我提供了所需的输出:
> head(outputdf)
facility site date outcome recent_success recent_failures
1 C 4 1990-01-23 1 15 23
2 B 1 1990-02-18 1 16 19
3 B 1 1990-02-01 1 16 19
4 A 5 1990-01-06 1 10 17
5 B 5 1990-01-10 0 16 19
6 C 3 1990-02-26 1 15 23
但是随着我的输入数据框架变得更大(并且更复杂),它会变得不合理地慢。输入数据约为150 MB。
那么,我的问题是如何加快此过程?这似乎非常适合dplyr :: Summarize()
,甚至还吸引了更多处理器(尽管我担心RAM利用率)。 循环的几乎可以肯定是慢速进行此操作的方法之一。
我已经根据其他帖子尝试了一些事情,例如确保在循环的中完成的计算很少。我已经在启动之前在输入数据框架中创建了日期间隔,但这似乎使输入更大。我也尝试分解自己的成功和失败,但这似乎只是在不加速的情况下将更多的记忆力引起(显然我的扼杀点不是比较两个数字)。
任何输入都将不胜感激!
I am trying to analyze a large data set (>1.5M observations) to look for correlations between the timing and location of events. Unfortunately, I am encountering performance issues in creating my analysis data set.
Essentially, an event can occur at one of several facilities (A, B, and C), several sites at each facility (1-6), and at a range of dates (between January and June 1, 1990), and have a result value of 1 and 0. Sites are near one another, so an event at site 2 is likely to affect sites 1, 2, and 3. Events have a persistent effect as well, so an event on March 10 might affect future events through March 17 or so.
Here is some example data:
set.seed(12345)
df <- data.frame(
facility = sample(
c("A","B","C"),
100,
replace=TRUE),
site = sample(
1:6,
100,
replace=TRUE),
date = as.Date(
sample(
c(lubridate::ymd("1990-1-1"):lubridate::ymd("1990-6-1")),
100,
replace=TRUE),
origin = "1970-01-01"
),
outcome = sample(
c(0,1),100,
replace=TRUE),
stringsAsFactors = FALSE
)
So far I have managed to get something that works through each iteration:
# A place to put the output
outputdf <- data.frame(
facility = character(),
site = numeric(),
date = as.Date(character()),
outcome = numeric(),
recent_success = integer(),
recent_failures = integer(),
stringsAsFactors = FALSE
)
# Loop through each iteration
for(i in 1:nrow(df)){
# Let me know how things are going in the console
print(paste("Event ",i," of ",nrow(df),sep=""))
#Choose just one event at a time
EventofInterest <- df[i,]
# Get site and facility information for that event
facility_of_interest <- EventofInterest$facility %>%
unlist()
site_of_interest <- EventofInterest$site %>%
unlist()
# Count up recent successes
recent_success <- df %>%
filter(outcome == 1,
facility %in% facility_of_interest,
site %in% c((site-1),site,(site+1)),
date %within% lubridate::interval(date-7,date)) %>%
nrow()
# Count up recent failures
recent_failures <- df %>%
filter(outcome == 0,
facility %in% facility_of_interest,
site %in% c((site-1),site,(site+1)),
date %within% lubridate::interval(date-7,date)) %>%
nrow()
# Create an output dataframe with the tallied successes and failures
outputdf <- EventofInterest %>%
mutate(recent_success = recent_success,
recent_failures = recent_failures
) %>%
# Bind that to the existing output dataframe
bind_rows(outputdf)
}
And it even gives me the output I'm looking for:
> head(outputdf)
facility site date outcome recent_success recent_failures
1 C 4 1990-01-23 1 15 23
2 B 1 1990-02-18 1 16 19
3 B 1 1990-02-01 1 16 19
4 A 5 1990-01-06 1 10 17
5 B 5 1990-01-10 0 16 19
6 C 3 1990-02-26 1 15 23
But it gets unreasonably slow as my input dataframe gets larger (and more complex). The input data is about 150 mb.
So, my question is how do I speed this process up? It seems like a good fit for something like dplyr::summarize()
, or even pulling in some more processors (although I worry about RAM utilization). A for
loop is almost certainly one of the slower ways to go about this.
I've already tried some things based on other posts, like making sure as little of the calculation is done within the for
loop as possible. I have created date intervals in the input dataframe before starting, but that just seems to make the input bigger. I've also tried splitting out my successes and failures, but that just seems to hog more memory without speeding things up (evidently my choke point is not in comparing two numbers).
Any input would be greatly appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不太确定这完全理解您的问题,但是听起来您想根据非Equi Join进行计数,该设施完全匹配,网站在+/- 1之内,日期在过去一周内。
data.table
和sqldf
处理非Equi加入,而DPLYR则没有。尽管如此,我认为我们可以通过制作数据副本(模拟相邻站点)并使用slider :: slide_index_dbl
在滑动时间窗口中获得相当性能的矢量化dplyr答案。I'm not quite sure this fully understands your question, but it sounds like you want to count based on a non-equi join, where facility matches exactly, site is within +/- 1 and date is within the past week.
data.table
andsqldf
handle non-equi joins, and dplyr does not. Nonetheless, I think we can get a pretty performant vectorized dplyr answer by making copies of the data (to simulate adjacent sites) and usingslider::slide_index_dbl
for the sliding time window.