帮我用“apply”替换 for 循环功能
...如果可能的话
我的任务是找到用户连续参与游戏的最长连续天数。
我没有编写 sql 函数,而是选择使用 R 的 rle 函数来获取最长的条纹,然后用结果更新我的数据库表。
(附加的)数据框是这样的:
day user_id
2008/11/01 2001
2008/11/01 2002
2008/11/01 2003
2008/11/01 2004
2008/11/01 2005
2008/11/02 2001
2008/11/02 2005
2008/11/03 2001
2008/11/03 2003
2008/11/03 2004
2008/11/03 2005
2008/11/04 2001
2008/11/04 2003
2008/11/04 2004
2008/11/04 2005
我尝试了以下方法来获取每个用户最长的条纹
# turn it to a contingency table
my_table <- table(user_id, day)
# get the streaks
rle_table <- apply(my_table,1,rle)
# verify the longest streak of "1"s for user 2001
# as.vector(tapply(rle_table$'2001'$lengths, rle_table$'2001'$values, max)["1"])
# loop to get the results
# initiate results matrix
res<-matrix(nrow=dim(my_table)[1], ncol=2)
for (i in 1:dim(my_table)[1]) {
string <- paste("as.vector(tapply(rle_table$'", rownames(my_table)[i], "'$lengths, rle_table$'", rownames(my_table)[i], "'$values, max)['1'])", sep="")
res[i,]<-c(as.integer(rownames(my_table)[i]) , eval(parse(text=string)))
}
不幸的是这个for循环花费的时间太长,我想知道是否有一种方法可以使用“apply”系列中的函数来生成res矩阵。
先感谢您
...if that is possible
My task is to find the longest streak of continuous days a user participated in a game.
Instead of writing an sql function, I chose to use the R's rle function, to get the longest streaks and then update my db table with the results.
The (attached) dataframe is something like this:
day user_id
2008/11/01 2001
2008/11/01 2002
2008/11/01 2003
2008/11/01 2004
2008/11/01 2005
2008/11/02 2001
2008/11/02 2005
2008/11/03 2001
2008/11/03 2003
2008/11/03 2004
2008/11/03 2005
2008/11/04 2001
2008/11/04 2003
2008/11/04 2004
2008/11/04 2005
I tried the following to get per user longest streak
# turn it to a contingency table
my_table <- table(user_id, day)
# get the streaks
rle_table <- apply(my_table,1,rle)
# verify the longest streak of "1"s for user 2001
# as.vector(tapply(rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance
2001'$lengths, rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance
2001'$values, max)["1"])
# loop to get the results
# initiate results matrix
res<-matrix(nrow=dim(my_table)[1], ncol=2)
for (i in 1:dim(my_table)[1]) {
string <- paste("as.vector(tapply(rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance
", rownames(my_table)[i], "'$lengths, rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance
", rownames(my_table)[i], "'$values, max)['1'])", sep="")
res[i,]<-c(as.integer(rownames(my_table)[i]) , eval(parse(text=string)))
}
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
apply
函数并不总是(甚至通常)比for
循环更快。这是 R 与 S-Plus 关联的残余(在后者中,apply 比 for 更快)。一个例外是lapply
,它通常比for
更快(因为它使用 C 代码)。 查看此相关问题。因此,您应该主要使用
apply
来提高代码的清晰度,而不是提高性能。您可能会发现 Dirk 关于高性能计算的演示很有用。另一种强力方法是“即时编译”,使用 Ra 而不是普通的 R 版本,它针对处理
for
循环进行了优化。[编辑:]显然有很多方法可以实现这一点,即使它更紧凑,这也绝不是更好的方法。仅使用您的代码,这是另一种方法:
您可能需要进一步操作输出。
The
apply
functions are not always (or even generally) faster than afor
loop. That is a remnant of R's associate with S-Plus (in the latter, apply is faster than for). One exception islapply
, which is frequently faster thanfor
(because it uses C code). See this related question.So you should use
apply
primarily to improve the clarity of code, not to improve performance.You might find Dirk's presentation on high-performance computing useful. One other brute force approach is "just-in-time compilation" with Ra instead of the normal R version, which is optimized to handle
for
loops.[Edit:] There are clearly many ways to achieve this, and this is by no means better even if it's more compact. Just working with your code, here's another approach:
You would probably need to manipulate the output a little further.
编辑:已修复。我最初认为我必须修改 rle() 的大部分内容,但事实证明只需要进行一些调整。
这不是关于 *apply 方法的答案,但我想知道这是否不是整个过程的更快方法。正如 Shane 所说,循环并没有那么糟糕。而且......我很少向任何人展示我的代码,所以我很高兴听到对此的一些批评。
EDIT: Fixed. I originally assumed that I would have to modify most of rle(), but it turns out only a few tweaks were needed.
This isn't an answer about an *apply method, but I wonder if this might not be a faster approach to the process overall. As Shane says, loops aren't so bad. And... I rarely get to show my code to anyone, so I'd be happy to hear some critique of this.
另一种选择
another option
如果您有一个非常长的数据列表,那么听起来可能是一个聚类问题。每个集群将由用户和日期定义,最大间隔距离为 1。然后按用户检索最大的簇。如果我想到具体的方法,我会编辑它。
If you've got a really long list of data than it sounds like maybe a clustering problem. Each cluster would be defined by a user and dates with a maximum separation distance of one. Then retrieve the largest cluster by user. I'll edit this if I think of a specific method.
这是Chris 关于如何获取数据的建议:
This was Chris's suggestion for how to get the data: