如何有效地对稀疏数据进行聚合
我有一个包含 1008412 个观察值的大型数据集, 这些列是 customer_id
(int)、visit_date
(日期,格式:“2010-04-04”)、visit_spend
(float)。
此聚合日期函数将感兴趣的周数映射到范围 13-65:
weekofperiod <- function(dt) {
as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}
每个 customer_id 在 53 周内的总访问次数是可变的。 对于每个 customer_id
,我想通过 weekofperiod()
获取 spend_per_week
的聚合。 下面的代码在功能上是正确的,但非常慢 - 注释可以使其更快吗? 另外,aggregate() 会生成稀疏输出,其中缺少没有访问的周,我将 spend_per_week 初始化为 0,然后按行手动分配aggregate() 的非零结果,以确保结果始终有 53 行。当然可以做得更好吗?
示例数据集行如下所示:
customer_id visit_date visit_spend
72 40 2011-03-15 18.38
73 40 2011-03-20 23.45
74 79 2010-04-07 150.87
75 79 2010-04-17 101.90
76 79 2010-05-02 111.90
这是带有聚合调用和空周调整的代码:
for (cid in all_tt_cids) {
print_pnq('Getting statistics for cid', cid)
# Get row indices of the selected subset, for just this cid's records
I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")
# (other code to compute other per-cid statistics)
# spend_per_week (mode;mean;sd)
# Aggregate spend_per_week, but beware this should be 0 for those week with no visits
spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
for (i in 1:nrow(nonzero_spends_per_week)) {
spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
}
colnames(spend_per_week)[2] <- 'spend_per_week'
# (code to compute and store per-cid statistics on spend_per_week)
}
I have a large dataset with 1008412 observations,
the columns are customer_id
(int), visit_date
(Date, format: "2010-04-04"), visit_spend
(float).
This date function for the aggregate maps week numbers of interest to the range 13-65:
weekofperiod <- function(dt) {
as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}
Each customer_id has a variable number of total visits over a 53-week period.
For each customer_id
, I want to get the aggregate of spend_per_week
, by weekofperiod()
.
The code below is functionally correct but very slow - comments to make it faster?
Also, aggregate()
produces sparse output where weeks without visits are missing, I initialize spend_per_week
to 0, then row-wise manually assign the non-zero results from aggregate(), to make sure the result always has 53 rows. Surely that can be done better?
Sample dataset lines look like:
customer_id visit_date visit_spend
72 40 2011-03-15 18.38
73 40 2011-03-20 23.45
74 79 2010-04-07 150.87
75 79 2010-04-17 101.90
76 79 2010-05-02 111.90
and here's the code with aggregate call and adjustment for empty weeks:
for (cid in all_tt_cids) {
print_pnq('Getting statistics for cid', cid)
# Get row indices of the selected subset, for just this cid's records
I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")
# (other code to compute other per-cid statistics)
# spend_per_week (mode;mean;sd)
# Aggregate spend_per_week, but beware this should be 0 for those week with no visits
spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
for (i in 1:nrow(nonzero_spends_per_week)) {
spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
}
colnames(spend_per_week)[2] <- 'spend_per_week'
# (code to compute and store per-cid statistics on spend_per_week)
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您替换
for
循环,您将会获得最大的速度提升。我无法从您的示例中完全看出,因为您覆盖了循环中的每个客户,但如果您想保留所有主题的信息,这里有一种方法。为了进行测试,首先为原始方法定义函数,然后定义一个没有循环的新方法:
现在模拟更大的数据集,以便更容易比较:
最后,比较这两种方法:
这已经快了 50 倍,并且我敢打赌,您可以通过更多优化使其变得更好。祝你好运!
Your biggest speed up is going to come if you replace the
for
loops. I can't quite tell from your example, because you overwrite each customer in the loop, but here's one way to do it if you want to keep the info for all subjects.For testing, first define functions for the original method, and a new method without loops:
Now simulate a larger dataset so it's easier to compare:
Finally, compare the two methods:
This is already 50x faster, and I bet you can make it even better with more optimization. Good luck!
这是使用
data.table
的更快方法,也更易于阅读。我们可以使用
rbenchmark
与FastMethod
和OrigMethod
进行基准测试,并发现我们比FastMethod
获得了 1.3 倍的加速整体加速 70 倍如果您不关心将最终输出重新调整为客户 ID 与一周的周期,则您可以进一步加快速度(与
FastMethod
相比为 2.5 倍)。Here is a faster method using
data.table
, which is also easier to read.We can benchmark this against
FastMethod
andOrigMethod
usingrbenchmark
, and see that we gain a 1.3x speedup overFastMethod
and an overall speedup of 70xYou can speed it up even further (2.5 x compared to
FastMethod
) if you did not care about reshaping the final output to customer id vs. week of period.