如何使计算/插入日期差异列更快？

发布于 2024-12-13 03:32:38 字数 1112 浏览 0 评论 0原文

你能让这个 R 代码更快吗？不知道如何对其进行矢量化。我有一个数据框如下（下面的示例行）：

> str(tt)
'data.frame':   1008142 obs. of  4 variables:
 $ customer_id: int, visit_date : Date, format: "2010-04-04", ...

我想计算客户的visit_dates 之间的差异。因此，我执行 diff(tt$visit_date)，但必须在 customer_id 发生变化的地方强制执行不连续性 (NA)，并且 diff 毫无意义，例如下面的第 74 行。底部的代码执行此操作，但在 1M 行数据集上需要 15 分钟以上。我还尝试了分段计算并绑定每个 customer_id 的子结果（使用 which()），这也很慢。有什么建议吗？谢谢。我确实搜索了 SO、R-intro、R 手册页等。

   customer_id visit_date visit_spend ivi
72          40 2011-03-15       18.38   5
73          40 2011-03-20       23.45   5
74          79 2010-04-07      150.87  NA
75          79 2010-04-17      101.90  10
76          79 2010-05-02      111.90  15

代码：（

all_tt_cids <- unique(tt$customer_id)

# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
  # ivi has a discontinuity when customer_id changes
  tt$ivi[min(which(tt$customer_id==cid))] <- NA
}

想知道我们是否可以创建一个逻辑索引，其中 customer_id 与上面的行不同？）

原文

Can you make this R code faster? Can't see how to vectorize it.
I have a data-frame as follows (sample rows below):

> str(tt)
'data.frame':   1008142 obs. of  4 variables:
 $ customer_id: int, visit_date : Date, format: "2010-04-04", ...

I want to compute the diff between visit_dates for a customer.
So I do diff(tt$visit_date), but have to enforce a discontinuity (NA) everywhere customer_id changes and the diff is meaningless, e.g. row 74 below.
The code at bottom does this, but takes >15 min on the 1M row dataset.
I also tried piecewise computing and cbind'ing the subresult per customer_id (using which()), that was also slow.
Any suggestions? Thanks. I did search SO, R-intro, R manpages, etc.

   customer_id visit_date visit_spend ivi
72          40 2011-03-15       18.38   5
73          40 2011-03-20       23.45   5
74          79 2010-04-07      150.87  NA
75          79 2010-04-17      101.90  10
76          79 2010-05-02      111.90  15

Code:

all_tt_cids <- unique(tt$customer_id)

# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
  # ivi has a discontinuity when customer_id changes
  tt$ivi[min(which(tt$customer_id==cid))] <- NA
}

(Wondering if we can create a logical index where customer_id differs to the row above?)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羁绊已千年 2024-12-20 03:32:38

要将 NA 设置到适当的位置，您可以再次使用 diff() 和一行技巧：

> tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA

解释

让我们采用一些向量 x

x <- c(1,1,1,1,2,2,2,4,4,4,5,3,3,3)

我们想要提取这样的索引，它以新数字开头，即（0,5,8,11,12）。我们可以使用 diff() 来实现这一点。

y <- c(1,diff(x))
# y = 1  0  0  0  1  0  0  2  0  0  1 -2  0  0

并取那些不等于零的索引：

x[y!=0] <- NA

to set NA to appropriate places, you again can use diff() and one-line trick:

> tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA

explanation

let's take some vector x

x <- c(1,1,1,1,2,2,2,4,4,4,5,3,3,3)

we want to extract such indexes, which start with new number, i.e. (0,5,8,11,12). We can use diff() for that.

y <- c(1,diff(x))
# y = 1  0  0  0  1  0  0  2  0  0  1 -2  0  0

and take those indexes, that are not equal to zero:

x[y!=0] <- NA

回复收藏 0 原文

~没有更多了~

关于作者

笔芯

暂无简介

0 文章

0 评论

25 人气

关注发私信

友情链接

文江博客

如何使计算/插入日期差异列更快？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

游缘惊梦

小兔几

Glik

生生漫

Luxian

Champion-Ming

友情链接

如何使计算/插入日期差异列更快？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

游缘惊梦

小兔几

Glik

生生漫

Luxian

Champion-Ming

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。