如何使计算/插入日期差异列更快?
你能让这个 R 代码更快吗?不知道如何对其进行矢量化。 我有一个数据框如下(下面的示例行):
> str(tt)
'data.frame': 1008142 obs. of 4 variables:
$ customer_id: int, visit_date : Date, format: "2010-04-04", ...
我想计算客户的visit_dates 之间的差异。 因此,我执行 diff(tt$visit_date)
,但必须在 customer_id 发生变化的地方强制执行不连续性 (NA
),并且 diff 毫无意义,例如下面的第 74 行。 底部的代码执行此操作,但在 1M 行数据集上需要 15 分钟以上。 我还尝试了分段计算并绑定每个 customer_id 的子结果(使用 which()
),这也很慢。 有什么建议吗?谢谢。我确实搜索了 SO、R-intro、R 手册页等。
customer_id visit_date visit_spend ivi
72 40 2011-03-15 18.38 5
73 40 2011-03-20 23.45 5
74 79 2010-04-07 150.87 NA
75 79 2010-04-17 101.90 10
76 79 2010-05-02 111.90 15
代码:(
all_tt_cids <- unique(tt$customer_id)
# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
# ivi has a discontinuity when customer_id changes
tt$ivi[min(which(tt$customer_id==cid))] <- NA
}
想知道我们是否可以创建一个逻辑索引,其中 customer_id 与上面的行不同?)
Can you make this R code faster? Can't see how to vectorize it.
I have a data-frame as follows (sample rows below):
> str(tt)
'data.frame': 1008142 obs. of 4 variables:
$ customer_id: int, visit_date : Date, format: "2010-04-04", ...
I want to compute the diff between visit_dates for a customer.
So I do diff(tt$visit_date)
, but have to enforce a discontinuity (NA
) everywhere customer_id changes and the diff is meaningless, e.g. row 74 below.
The code at bottom does this, but takes >15 min on the 1M row dataset.
I also tried piecewise computing and cbind'ing the subresult per customer_id (using which()
), that was also slow.
Any suggestions? Thanks. I did search SO, R-intro, R manpages, etc.
customer_id visit_date visit_spend ivi
72 40 2011-03-15 18.38 5
73 40 2011-03-20 23.45 5
74 79 2010-04-07 150.87 NA
75 79 2010-04-17 101.90 10
76 79 2010-05-02 111.90 15
Code:
all_tt_cids <- unique(tt$customer_id)
# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
# ivi has a discontinuity when customer_id changes
tt$ivi[min(which(tt$customer_id==cid))] <- NA
}
(Wondering if we can create a logical index where customer_id differs to the row above?)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
要将
NA
设置到适当的位置,您可以再次使用diff()
和一行技巧:解释
让我们采用一些向量
x
我们想要提取这样的索引,它以新数字开头,即(0,5,8,11,12)。我们可以使用 diff() 来实现这一点。
并取那些不等于零的索引:
to set
NA
to appropriate places, you again can usediff()
and one-line trick:explanation
let's take some vector
x
we want to extract such indexes, which start with new number, i.e. (0,5,8,11,12). We can use
diff()
for that.and take those indexes, that are not equal to zero: