帮我用“apply”替换 for 循环功能

发布于 2024-08-06 21:24:01 字数 1220 浏览 6 评论 0原文

...如果可能的话

我的任务是找到用户连续参与游戏的最长连续天数。

我没有编写 sql 函数，而是选择使用 R 的 rle 函数来获取最长的条纹，然后用结果更新我的数据库表。

（附加的）数据框是这样的：

    day      user_id
2008/11/01    2001
2008/11/01    2002
2008/11/01    2003
2008/11/01    2004
2008/11/01    2005
2008/11/02    2001
2008/11/02    2005
2008/11/03    2001
2008/11/03    2003
2008/11/03    2004
2008/11/03    2005
2008/11/04    2001
2008/11/04    2003
2008/11/04    2004
2008/11/04    2005

我尝试了以下方法来获取每个用户最长的条纹

# turn it to a contingency table
my_table <- table(user_id, day)

# get the streaks
rle_table <- apply(my_table,1,rle)

# verify the longest streak of "1"s for user 2001
# as.vector(tapply(rle_table$'2001'$lengths, rle_table$'2001'$values, max)["1"])

# loop to get the results
# initiate results matrix
res<-matrix(nrow=dim(my_table)[1], ncol=2)

for (i in 1:dim(my_table)[1]) {
string <- paste("as.vector(tapply(rle_table$'", rownames(my_table)[i], "'$lengths, rle_table$'", rownames(my_table)[i], "'$values, max)['1'])", sep="")
res[i,]<-c(as.integer(rownames(my_table)[i]) , eval(parse(text=string)))
}

不幸的是这个for循环花费的时间太长，我想知道是否有一种方法可以使用“apply”系列中的函数来生成res矩阵。

先感谢您

原文

...if that is possible

My task is to find the longest streak of continuous days a user participated in a game.

Instead of writing an sql function, I chose to use the R's rle function, to get the longest streaks and then update my db table with the results.

The (attached) dataframe is something like this:

    day      user_id
2008/11/01    2001
2008/11/01    2002
2008/11/01    2003
2008/11/01    2004
2008/11/01    2005
2008/11/02    2001
2008/11/02    2005
2008/11/03    2001
2008/11/03    2003
2008/11/03    2004
2008/11/03    2005
2008/11/04    2001
2008/11/04    2003
2008/11/04    2004
2008/11/04    2005

I tried the following to get per user longest streak

# turn it to a contingency table
my_table <- table(user_id, day)

# get the streaks
rle_table <- apply(my_table,1,rle)

# verify the longest streak of "1"s for user 2001
# as.vector(tapply(rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance 
2001'$lengths, rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance 
2001'$values, max)["1"])

# loop to get the results
# initiate results matrix
res<-matrix(nrow=dim(my_table)[1], ncol=2)

for (i in 1:dim(my_table)[1]) {
string <- paste("as.vector(tapply(rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance 
", rownames(my_table)[i], "'$lengths, rle_table
Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.
Thank you in advance 
", rownames(my_table)[i], "'$values, max)['1'])", sep="")
res[i,]<-c(as.integer(rownames(my_table)[i]) , eval(parse(text=string)))
}

Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.

Thank you in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我不会写诗 2024-08-13 21:24:01

apply 函数并不总是（甚至通常）比 for 循环更快。这是 R 与 S-Plus 关联的残余（在后者中，apply 比 for 更快）。一个例外是 lapply，它通常比 for 更快（因为它使用 C 代码）。查看此相关问题。

因此，您应该主要使用 apply 来提高代码的清晰度，而不是提高性能。

您可能会发现 Dirk 关于高性能计算的演示很有用。另一种强力方法是“即时编译”，使用 Ra 而不是普通的 R 版本，它针对处理 for 循环进行了优化。

[编辑：]显然有很多方法可以实现这一点，即使它更紧凑，这也绝不是更好的方法。仅使用您的代码，这是另一种方法：

dt <- data.frame(table(dat))[,2:3]
dt.b <- by(dt[,2], dt[,1], rle)
t(data.frame(lapply(dt.b, function(x) max(x$length))))

您可能需要进一步操作输出。

The apply functions are not always (or even generally) faster than a for loop. That is a remnant of R's associate with S-Plus (in the latter, apply is faster than for). One exception is lapply, which is frequently faster than for (because it uses C code). See this related question.

So you should use apply primarily to improve the clarity of code, not to improve performance.

You might find Dirk's presentation on high-performance computing useful. One other brute force approach is "just-in-time compilation" with Ra instead of the normal R version, which is optimized to handle for loops.

[Edit:] There are clearly many ways to achieve this, and this is by no means better even if it's more compact. Just working with your code, here's another approach:

dt <- data.frame(table(dat))[,2:3]
dt.b <- by(dt[,2], dt[,1], rle)
t(data.frame(lapply(dt.b, function(x) max(x$length))))

You would probably need to manipulate the output a little further.

回复收藏 0 原文

掩耳倾听 2024-08-13 21:24:01

编辑：已修复。我最初认为我必须修改 rle() 的大部分内容，但事实证明只需要进行一些调整。

这不是关于 *apply 方法的答案，但我想知道这是否不是整个过程的更快方法。正如 Shane 所说，循环并没有那么糟糕。而且......我很少向任何人展示我的代码，所以我很高兴听到对此的一些批评。

#Shane, I told you this was awesome
dat <- getSOTable("http://stackoverflow.com/questions/1504832/help-me-replace-a-for-loop-with-an-apply-function", 1)
colnames(dat) <- c("day", "user_id")
#Convert to dates so that arithmetic works properly on them
dat$day <- as.Date(dat$day)

#Custom rle for dates
rle.date <- function (x)
{
    #Accept only dates
    if (class(x) != "Date")
        stop("'x' must be an object of class \"Date\"")
    n <- length(x)
    if (n == 0L)
        return(list(lengths = integer(0L), values = x))
    #Dates need to be sorted
    x.sort <- sort(x)
    #y is a vector indicating at which indices the date is not consecutive with its predecessor
    y <- x.sort[-1L] != (x.sort + 1)[-n]
    #i returns the indices of y that are TRUE, and appends the index of the last value
    i <- c(which(y | is.na(y)), n)
    #diff tells you the distances in between TRUE/non-consecutive dates. max gets the largest of these.
    max(diff(c(0L, i)))
}

#Loop
max.consec.use <- matrix(nrow = length(unique(dat$user_id)), ncol = 1)
rownames(max.consec.use) <- unique(dat$user_id)

for(i in 1:length(unique(dat$user_id))){
    user <- unique(dat$user_id)[i]
    uses <- subset(dat, user_id %in% user)
    max.consec.use[paste(user), 1] <- rle.date(uses$day)
}

max.consec.use

EDIT: Fixed. I originally assumed that I would have to modify most of rle(), but it turns out only a few tweaks were needed.

This isn't an answer about an *apply method, but I wonder if this might not be a faster approach to the process overall. As Shane says, loops aren't so bad. And... I rarely get to show my code to anyone, so I'd be happy to hear some critique of this.

#Shane, I told you this was awesome
dat <- getSOTable("http://stackoverflow.com/questions/1504832/help-me-replace-a-for-loop-with-an-apply-function", 1)
colnames(dat) <- c("day", "user_id")
#Convert to dates so that arithmetic works properly on them
dat$day <- as.Date(dat$day)

#Custom rle for dates
rle.date <- function (x)
{
    #Accept only dates
    if (class(x) != "Date")
        stop("'x' must be an object of class \"Date\"")
    n <- length(x)
    if (n == 0L)
        return(list(lengths = integer(0L), values = x))
    #Dates need to be sorted
    x.sort <- sort(x)
    #y is a vector indicating at which indices the date is not consecutive with its predecessor
    y <- x.sort[-1L] != (x.sort + 1)[-n]
    #i returns the indices of y that are TRUE, and appends the index of the last value
    i <- c(which(y | is.na(y)), n)
    #diff tells you the distances in between TRUE/non-consecutive dates. max gets the largest of these.
    max(diff(c(0L, i)))
}

#Loop
max.consec.use <- matrix(nrow = length(unique(dat$user_id)), ncol = 1)
rownames(max.consec.use) <- unique(dat$user_id)

for(i in 1:length(unique(dat$user_id))){
    user <- unique(dat$user_id)[i]
    uses <- subset(dat, user_id %in% user)
    max.consec.use[paste(user), 1] <- rle.date(uses$day)
}

max.consec.use

回复收藏 0 原文

欢烬 2024-08-13 21:24:01

另一种选择

# convert to Date
day_table$day <- as.Date(day_table$day, format="%Y/%m/%d")
# split by user and then look for contiguous days
contig <- sapply(split(day_table$day, day_table$user_id), function(.days){
    .diff <- cumsum(c(TRUE, diff(.days) != 1))
    max(table(.diff))
})

another option

# convert to Date
day_table$day <- as.Date(day_table$day, format="%Y/%m/%d")
# split by user and then look for contiguous days
contig <- sapply(split(day_table$day, day_table$user_id), function(.days){
    .diff <- cumsum(c(TRUE, diff(.days) != 1))
    max(table(.diff))
})

回复收藏 0 原文

最单纯的乌龟 2024-08-13 21:24:01

如果您有一个非常长的数据列表，那么听起来可能是一个聚类问题。每个集群将由用户和日期定义，最大间隔距离为 1。然后按用户检索最大的簇。如果我想到具体的方法，我会编辑它。

回复收藏 0 原文

清风无影 2024-08-13 21:24:01

这是Chris 关于如何获取数据的建议：

dat <- read.table(textConnection(
 "day      user_id
 2008/11/01    2001
 2008/11/01    2002
 2008/11/01    2003
 2008/11/01    2004
 2008/11/01    2005
 2008/11/02    2001
 2008/11/02    2005
 2008/11/03    2001
 2008/11/03    2003
 2008/11/03    2004
 2008/11/03    2005
 2008/11/04    2001
 2008/11/04    2003
 2008/11/04    2004
 2008/11/04    2005
 "), header=TRUE)

This was Chris's suggestion for how to get the data:

dat <- read.table(textConnection(
 "day      user_id
 2008/11/01    2001
 2008/11/01    2002
 2008/11/01    2003
 2008/11/01    2004
 2008/11/01    2005
 2008/11/02    2001
 2008/11/02    2005
 2008/11/03    2001
 2008/11/03    2003
 2008/11/03    2004
 2008/11/03    2005
 2008/11/04    2001
 2008/11/04    2003
 2008/11/04    2004
 2008/11/04    2005
 "), header=TRUE)

回复收藏 0 原文

~没有更多了~