加速R中的循环操作

发布于 2024-09-02 22:39:13 字数 772 浏览 14 评论 0 原文

我在 R 中遇到了很大的性能问题。我编写了一个迭代 data.frame 对象的函数。它只是向 data.frame 添加一个新列并积累一些内容。 (操作简单)。 data.frame 大约有 850K 行。我的电脑仍在工作(现在大约 10 小时),我不知道运行时间。

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

有什么想法可以加快此操作吗?

I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and accumulates something. (simple operation). The data.frame has roughly 850K rows. My PC is still working (about 10h now) and I have no idea about the runtime.

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

Any ideas how to speed up this operation?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

歌入人心 2024-09-09 22:39:13

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

Kumm.` <- res return(temp) }

正如您所看到的,我创建了收集结果的向量 res 。最后,我将其添加到 data.frame 中,并且不需要弄乱名称。
那么如何更好呢?

我使用 nrow 从 1,000 到 10,000 × 1,000 运行 data.frame 的每个函数,并使用 system.time 测量时间,

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果为

performance

您可以看到您的版本与 nrow(X) 呈指数关系。修改后的版本具有线性关系,简单的lm模型预测850,000行的计算需要6分10秒。

矢量化的力量

正如 Shane 和 Calimo 在他们的回答中指出的那样,矢量化是获得更好性能的关键。
从您的代码中,您可以移出循环:

  • 结果的条件
  • 初始化(即 temp[i,9]

这将导致此代码

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

Kumm.` <- res return(temp) }

正如您所看到的,我创建了收集结果的向量 res 。最后,我将其添加到 data.frame 中,并且不需要弄乱名称。
那么如何更好呢?

我使用 nrow 从 1,000 到 10,000 × 1,000 运行 data.frame 的每个函数,并使用 system.time 测量时间,

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果为

performance

您可以看到您的版本与 nrow(X) 呈指数关系。修改后的版本具有线性关系,简单的lm模型预测850,000行的计算需要6分10秒。

矢量化的力量

正如 Shane 和 Calimo 在他们的回答中指出的那样,矢量化是获得更好性能的关键。
从您的代码中,您可以移出循环:

  • 结果的条件
  • 初始化(即 temp[i,9]

这将导致此代码

Kumm.` <- res return(temp) }

比较此函数的结果,这次是 nrow 从 10,000 到 100,000,增加 10,000。

performance

调整调整的

另一个调整是更改循环索引temp[i,9]res[i] (在第 i 次循环迭代中完全相同)。
这又是索引向量和索引 data.frame 之间的区别。
第二件事:当您查看循环时,您可以看到不需要循环所有 i,而只需循环那些符合条件的。
因此,我们获得

dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

Kumm.` <- res return(temp) }

正如您所看到的,我创建了收集结果的向量 res 。最后,我将其添加到 data.frame 中,并且不需要弄乱名称。
那么如何更好呢?

我使用 nrow 从 1,000 到 10,000 × 1,000 运行 data.frame 的每个函数,并使用 system.time 测量时间,

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果为

performance

您可以看到您的版本与 nrow(X) 呈指数关系。修改后的版本具有线性关系,简单的lm模型预测850,000行的计算需要6分10秒。

矢量化的力量

正如 Shane 和 Calimo 在他们的回答中指出的那样,矢量化是获得更好性能的关键。
从您的代码中,您可以移出循环:

  • 结果的条件
  • 初始化(即 temp[i,9]

这将导致此代码

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

最大的问题和无效的根源是索引 data.frame,我的意思是使用 temp[,] 的所有这些行。
尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A

Kumm.` <- res return(temp) }

正如您所看到的,我创建了收集结果的向量 res 。最后,我将其添加到 data.frame 中,并且不需要弄乱名称。
那么如何更好呢?

我使用 nrow 从 1,000 到 10,000 × 1,000 运行 data.frame 的每个函数,并使用 system.time 测量时间,

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果为

performance

您可以看到您的版本与 nrow(X) 呈指数关系。修改后的版本具有线性关系,简单的lm模型预测850,000行的计算需要6分10秒。

矢量化的力量

正如 Shane 和 Calimo 在他们的回答中指出的那样,矢量化是获得更好性能的关键。
从您的代码中,您可以移出循环:

  • 结果的条件
  • 初始化(即 temp[i,9]

这将导致此代码

Kumm.` <- res return(temp) }

比较此函数的结果,这次是 nrow 从 10,000 到 100,000,增加 10,000。

performance

调整调整的

另一个调整是更改循环索引temp[i,9]res[i] (在第 i 次循环迭代中完全相同)。
这又是索引向量和索引 data.frame 之间的区别。
第二件事:当您查看循环时,您可以看到不需要循环所有 i,而只需循环那些符合条件的。
因此,我们获得

Kumm.` <- res return(temp) }

的性能很大程度上取决于数据结构。精确 - 条件中 TRUE 值的百分比。
对于我的模拟数据,850,000 行的计算时间低于一秒。

performance

我希望你能更进一步,我看到至少有两件事可以做:

  • 写一个 C 执行条件求和的代码
  • 如果您知道数据中的最大序列并不大,那么您可以将循环更改为矢量化 while,例如

    while (任意(条件)) {
        indx <- c(FALSE, cond[-1] & !cond[-n])
        res[indx] <- res[indx] + res[which(indx)-1]
        cond[indx] <- FALSE
    }
    

用于模拟和数字的代码是 可在 GitHub 上获取

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

Kumm.` <- res return(temp) }

As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names.
So how better is it?

I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

Result is

performance

You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.

Power of vectorization

As Shane and Calimo states in theirs answers vectorization is a key to better performance.
From your code you could move outside of loop:

  • conditioning
  • initialization of the results (which are temp[i,9])

This leads to this code

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

Kumm.` <- res return(temp) }

As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names.
So how better is it?

I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

Result is

performance

You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.

Power of vectorization

As Shane and Calimo states in theirs answers vectorization is a key to better performance.
From your code you could move outside of loop:

  • conditioning
  • initialization of the results (which are temp[i,9])

This leads to this code

Kumm.` <- res return(temp) }

Compare result for this functions, this time for nrow from 10,000 to 100,000 by 10,000.

performance

Tuning the tuned

Another tweak is to changing in a loop indexing temp[i,9] to res[i] (which are exact the same in i-th loop iteration).
It's again difference between indexing a vector and indexing a data.frame.
Second thing: when you look on the loop you can see that there is no need to loop over all i, but only for the ones that fit condition.
So here we go

dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

Kumm.` <- res return(temp) }

As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names.
So how better is it?

I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

Result is

performance

You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.

Power of vectorization

As Shane and Calimo states in theirs answers vectorization is a key to better performance.
From your code you could move outside of loop:

  • conditioning
  • initialization of the results (which are temp[i,9])

This leads to this code

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

Kumm.` <- res return(temp) }

As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names.
So how better is it?

I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

Result is

performance

You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.

Power of vectorization

As Shane and Calimo states in theirs answers vectorization is a key to better performance.
From your code you could move outside of loop:

  • conditioning
  • initialization of the results (which are temp[i,9])

This leads to this code

Kumm.` <- res return(temp) }

Compare result for this functions, this time for nrow from 10,000 to 100,000 by 10,000.

performance

Tuning the tuned

Another tweak is to changing in a loop indexing temp[i,9] to res[i] (which are exact the same in i-th loop iteration).
It's again difference between indexing a vector and indexing a data.frame.
Second thing: when you look on the loop you can see that there is no need to loop over all i, but only for the ones that fit condition.
So here we go

Kumm.` <- res return(temp) }

Performance which you gain highly depends on a data structure. Precisely - on percent of TRUE values in the condition.
For my simulated data it takes computation time for 850,000 rows below the one second.

performance

I you want you can go further, I see at least two things which can be done:

  • write a C code to do conditional cumsum
  • if you know that in your data max sequence isn't large then you can change loop to vectorized while, something like

    while (any(cond)) {
        indx <- c(FALSE, cond[-1] & !cond[-n])
        res[indx] <- res[indx] + res[which(indx)-1]
        cond[indx] <- FALSE
    }
    

Code used for simulations and figures is available on GitHub.

牵你的手,一向走下去 2024-09-09 22:39:13

加速 R 代码的一般策略

首先,找出慢的部分到底在哪里。无需优化运行速度不慢的代码。对于少量的代码,简单地思考一下就可以了。如果失败,RProf 和类似的分析工具可能会有所帮助。

一旦找出瓶颈,请考虑更有效的算法来完成您想要的事情。如果可能的话,计算应该只运行一次,因此:

使用更多 高效功能可以产生中等或较大的速度增益。例如,paste0 产生了较小的效率增益,但 .colSums() 及其相关函数产生了更明显的增益。 mean特别慢

这样您就可以避免一些特别常见的麻烦

  • cbind 会很快减慢您的速度。
  • 初始化数据结构,然后填写它们, 而不是分别扩展它们
    时间
  • 即使使用预分配,您也可以切换到按引用传递方法而不是按值传递方法,但这可能不值得这么麻烦。
  • 查看 R Inferno 了解更多要避免的陷阱。

尝试更好的矢量化,这通常可以但并不总是有帮助。在这方面,固有的向量化命令,例如 ifelsediff 等,将比 apply 系列命令提供更多改进(这些命令提供的功能很少)与编写良好的循环相比没有速度提升)。

您还可以尝试向 R 函数提供更多信息。例如,使用 vapply 而不是 sapply,并指定 colClasses 读取基于文本的数据时。速度增益将根据您消除猜测的程度而变化。

接下来,考虑优化包data.table 包可以在数据操作和读取大量数据 (fread) 中使用时产生巨大的速度增益。

接下来,尝试通过更有效的调用 R 的方式来提高速度:

  • 编译 R 脚本。或者同时使用 Rajit 包进行即时编译(Dirk 在 此演示文稿)。
  • 确保您使用的是优化的 BLAS。这些提供了全面的速度增益。老实说,遗憾的是 R 在安装时没有自动使用最高效的库。希望 Revolution R 能够将他们在这里所做的工作回馈给整个社区。
  • Radford Neal 做了很多优化,其中一些被采用到 R Core 中,还有许多其他优化被分叉到 pqR

最后,如果上述所有方法仍然不能让您达到所需的速度,您可能需要转向更快的语言来处理缓慢的代码片段。这里将 Rcppinline 结合起来,使得用 C++ 代码替换算法中最慢的部分变得特别容易。例如,这是我第一次尝试这样做,它甚至击败了高度优化的 R 解决方案。

如果在这一切之后您仍然遇到麻烦,那么您只需要更多的计算能力。研究并行化 (http://cran.r- project.org/web/views/HighPerformanceComputing.html),甚至基于 GPU 的解决方案(gpu-tools)。

其他指南的链接

General strategies for speeding up R code

First, figure out where the slow part really is. There's no need to optimize code that isn't running slowly. For small amounts of code, simply thinking through it can work. If that fails, RProf and similar profiling tools can be helpful.

Once you figure out the bottleneck, think about more efficient algorithms for doing what you want. Calculations should be only run once if possible, so:

Using more efficient functions can produce moderate or large speed gains. For instance, paste0 produces a small efficiency gain but .colSums() and its relatives produce somewhat more pronounced gains. mean is particularly slow.

Then you can avoid some particularly common troubles:

  • cbind will slow you down really quickly.
  • Initialize your data structures, then fill them in, rather than expanding them each
    time
    .
  • Even with pre-allocation, you could switch to a pass-by-reference approach rather than a pass-by-value approach, but it may not be worth the hassle.
  • Take a look at the R Inferno for more pitfalls to avoid.

Try for better vectorization, which can often but not always help. In this regard, inherently vectorized commands like ifelse, diff, and the like will provide more improvement than the apply family of commands (which provide little to no speed boost over a well-written loop).

You can also try to provide more information to R functions. For instance, use vapply rather than sapply, and specify colClasses when reading in text-based data. Speed gains will be variable depending on how much guessing you eliminate.

Next, consider optimized packages: The data.table package can produce massive speed gains where its use is possible, in data manipulation and in reading large amounts of data (fread).

Next, try for speed gains through more efficient means of calling R:

  • Compile your R script. Or use the Ra and jit packages in concert for just-in-time compilation (Dirk has an example in this presentation).
  • Make sure you're using an optimized BLAS. These provide across-the-board speed gains. Honestly, it's a shame that R doesn't automatically use the most efficient library on install. Hopefully Revolution R will contribute the work that they've done here back to the overall community.
  • Radford Neal has done a bunch of optimizations, some of which were adopted into R Core, and many others which were forked off into pqR.

And lastly, if all of the above still doesn't get you quite as fast as you need, you may need to move to a faster language for the slow code snippet. The combination of Rcpp and inline here makes replacing only the slowest part of the algorithm with C++ code particularly easy. Here, for instance, is my first attempt at doing so, and it blows away even highly optimized R solutions.

If you're still left with troubles after all this, you just need more computing power. Look into parallelization (http://cran.r-project.org/web/views/HighPerformanceComputing.html) or even GPU-based solutions (gpu-tools).

Links to other guidance

呢古 2024-09-09 22:39:13

如果您使用 for 循环,那么您很可能将 R 编码为 C 或 Java 或其他语言。正确矢量化的 R 代码速度非常快。

以这两个简单的代码位为例,按顺序生成 10,000 个整数的列表:

第一个代码示例是如何使用传统编码范例对循环进行编码。完成需要 28 秒

system.time({
    a <- NULL
    for(i in 1:1e5)a[i] <- i
})
   user  system elapsed 
  28.36    0.07   28.61 

您可以通过预分配内存的简单操作获得近 100 倍的改进:

system.time({
    a <- rep(1, 1e5)
    for(i in 1:1e5)a[i] <- i
})

   user  system elapsed 
   0.30    0.00    0.29 

但是使用使用冒号运算符 : 的基本 R 向量运算,此操作几乎是瞬时的:

system.time(a <- 1:1e5)

   user  system elapsed 
      0       0       0 

If you are using for loops, you are most likely coding R as if it was C or Java or something else. R code that is properly vectorised is extremely fast.

Take for example these two simple bits of code to generate a list of 10,000 integers in sequence:

The first code example is how one would code a loop using a traditional coding paradigm. It takes 28 seconds to complete

system.time({
    a <- NULL
    for(i in 1:1e5)a[i] <- i
})
   user  system elapsed 
  28.36    0.07   28.61 

You can get an almost 100-times improvement by the simple action of pre-allocating memory:

system.time({
    a <- rep(1, 1e5)
    for(i in 1:1e5)a[i] <- i
})

   user  system elapsed 
   0.30    0.00    0.29 

But using the base R vector operation using the colon operator : this operation is virtually instantaneous:

system.time(a <- 1:1e5)

   user  system elapsed 
      0       0       0 
顾北清歌寒 2024-09-09 22:39:13

通过使用索引或嵌套的 ifelse() 语句跳过循环,可以使速度更快。

idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."

This could be made much faster by skipping the loops by using indexes or nested ifelse() statements.

idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."
信愁 2024-09-09 22:39:13

正如 Ari 在他的回答末尾提到的那样,Rcppinline 包使得加快速度变得非常容易。作为一个例子,尝试这个内联代码(警告:未经测试):

body <- 'Rcpp::NumericMatrix nm(temp);
         int nrtemp = Rccp::as<int>(nrt);
         for (int i = 0; i < nrtemp; ++i) {
             temp(i, 9) = i
             if (i > 1) {
                 if ((temp(i, 5) == temp(i - 1, 5) && temp(i, 2) == temp(i - 1, 2) {
                     temp(i, 9) = temp(i, 8) + temp(i - 1, 9)
                 } else {
                     temp(i, 9) = temp(i, 8)
                 }
             } else {
                 temp(i, 9) = temp(i, 8)
             }
         return Rcpp::wrap(nm);
        '

settings <- getPlugin("Rcpp")
# settings$env$PKG_CXXFLAGS <- paste("-I", getwd(), sep="") if you want to inc files in wd
dayloop <- cxxfunction(signature(nrt="numeric", temp="numeric"), body-body,
    plugin="Rcpp", settings=settings, cppargs="-I/usr/include")

dayloop2 <- function(temp) {
    # extract a numeric matrix from temp, put it in tmp
    nc <- ncol(temp)
    nm <- dayloop(nc, temp)
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

有一个类似的#include处理过程,您只需将一个参数传递

inc <- '#include <header.h>

给cxxfunction,如<代码>include=inc。真正酷的是它会为您完成所有链接和编译,因此原型制作速度非常快。

免责声明:我不完全确定 tmp 的类应该是数字而不是数字矩阵或其他东西。但我基本确定。

编辑:如果在此之后您仍然需要更快的速度,OpenMP 是一个适合 C++。我还没有尝试从 inline 使用它,但它应该可以工作。这个想法是,在 n 核的情况下,让循环迭代 kk % n 执行。合适的介绍可以在 Matloff 的R 编程艺术中找到,可此处,第 16 章,求助于 C

As Ari mentioned at the end of his answer, the Rcpp and inline packages make it incredibly easy to make things fast. As an example, try this inline code (warning: not tested):

body <- 'Rcpp::NumericMatrix nm(temp);
         int nrtemp = Rccp::as<int>(nrt);
         for (int i = 0; i < nrtemp; ++i) {
             temp(i, 9) = i
             if (i > 1) {
                 if ((temp(i, 5) == temp(i - 1, 5) && temp(i, 2) == temp(i - 1, 2) {
                     temp(i, 9) = temp(i, 8) + temp(i - 1, 9)
                 } else {
                     temp(i, 9) = temp(i, 8)
                 }
             } else {
                 temp(i, 9) = temp(i, 8)
             }
         return Rcpp::wrap(nm);
        '

settings <- getPlugin("Rcpp")
# settings$env$PKG_CXXFLAGS <- paste("-I", getwd(), sep="") if you want to inc files in wd
dayloop <- cxxfunction(signature(nrt="numeric", temp="numeric"), body-body,
    plugin="Rcpp", settings=settings, cppargs="-I/usr/include")

dayloop2 <- function(temp) {
    # extract a numeric matrix from temp, put it in tmp
    nc <- ncol(temp)
    nm <- dayloop(nc, temp)
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

There's a similar procedure for #includeing things, where you just pass a parameter

inc <- '#include <header.h>

to cxxfunction, as include=inc. What's really cool about this is that it does all of the linking and compilation for you, so prototyping is really fast.

Disclaimer: I'm not totally sure that the class of tmp should be numeric and not numeric matrix or something else. But I'm mostly sure.

Edit: if you still need more speed after this, OpenMP is a parallelization facility good for C++. I haven't tried using it from inline, but it should work. The idea would be to, in the case of n cores, have loop iteration k be carried out by k % n. A suitable introduction is found in Matloff's The Art of R Programming, available here, in chapter 16, Resorting to C.

半枫 2024-09-09 22:39:13

我不喜欢重写代码...当然 ifelse 和 lapply 是更好的选择,但有时很难做到这一点。

我经常使用 data.frames,就像使用诸如 df$var[i] 这样的列表,

这是一个虚构的示例:

nrow=function(x){ ##required as I use nrow at times.
  if(class(x)=='list') {
    length(x[[names(x)[1]]])
  }else{
    base::nrow(x)
  }
}

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
})

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  d=as.list(d) #become a list
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  d=as.data.frame(d) #revert back to data.frame
})

data.frame version:

   user  system elapsed 
   0.53    0.00    0.53

list version:

   user  system elapsed 
   0.04    0.00    0.03 

17x times fast to use a list of 矢量比数据框。

关于为什么内部 data.frames 在这方面如此缓慢有什么评论吗?人们可能会认为它们像列表一样操作...

为了更快的代码,可以使用 class(d)='list' 而不是 d=as.list(d) 和 <代码>class(d)='data.frame'

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  class(d)='list'
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  class(d)='data.frame'
})
head(d)

I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.

Frequently I use data.frames as one would use lists such as df$var[i]

Here is a made up example:

nrow=function(x){ ##required as I use nrow at times.
  if(class(x)=='list') {
    length(x[[names(x)[1]]])
  }else{
    base::nrow(x)
  }
}

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
})

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  d=as.list(d) #become a list
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  d=as.data.frame(d) #revert back to data.frame
})

data.frame version:

   user  system elapsed 
   0.53    0.00    0.53

list version:

   user  system elapsed 
   0.04    0.00    0.03 

17x times faster to use a list of vectors than a data.frame.

Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...

For even faster code do this class(d)='list' instead of d=as.list(d) and class(d)='data.frame'

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  class(d)='list'
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  class(d)='data.frame'
})
head(d)
暮年 2024-09-09 22:39:13

这里的答案很棒。未涵盖的一个小方面是问题指出“我的电脑仍在工作(现在大约 10 小时),我不知道运行时间”。在开发时,我总是将以下代码放入循环中,以了解更改如何影响速度,并监控完成所需的时间。

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    cat(round(i/nrow(temp)*100,2),"%    \r") # prints the percentage complete in realtime.
    # do stuff
  }
  return(blah)
}

也适用于 lapply。

dayloop2 <- function(temp){
  temp <- lapply(1:nrow(temp), function(i) {
    cat(round(i/nrow(temp)*100,2),"%    \r")
    #do stuff
  })
  return(temp)
}

如果循环内的函数非常快但循环数量很大,那么请考虑每隔一段时间打印一次,因为打印到控制台本身会产生开销。例如

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"%    \r") # prints every 100 times through the loop
    # do stuff
  }
  return(temp)
}

The answers here are great. One minor aspect not covered is that the question states "My PC is still working (about 10h now) and I have no idea about the runtime". I always put in the following code into loops when developing to get a feel for how changes seem to affect the speed and also for monitoring how long it will take to complete.

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    cat(round(i/nrow(temp)*100,2),"%    \r") # prints the percentage complete in realtime.
    # do stuff
  }
  return(blah)
}

Works with lapply as well.

dayloop2 <- function(temp){
  temp <- lapply(1:nrow(temp), function(i) {
    cat(round(i/nrow(temp)*100,2),"%    \r")
    #do stuff
  })
  return(temp)
}

If the function within the loop is quite fast but the number of loops is large then consider just printing every so often as printing to the console itself has an overhead. e.g.

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"%    \r") # prints every 100 times through the loop
    # do stuff
  }
  return(temp)
}
时光沙漏 2024-09-09 22:39:13

在 R 中,您通常可以使用 apply 系列函数来加速循环处理(在您的情况下,它可能是 replicate)。看一下提供进度条的 plyr 包。

另一种选择是完全避免循环并用矢量化算术代替它们。我不确定你到底在做什么,但你可能可以一次将你的函数应用到所有行:

temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]

这会快得多,然后你可以用你的条件过滤行:

cond.i <- (temp[i, 6] == temp[i-1, 6]) & (temp[i, 3] == temp[i-1, 3])
temp[cond.i, 10] <- temp[cond.i, 9]

矢量化算术需要更多的时间和思考问题,但有时可以节省几个数量级的执行时间。

In R, you can often speed-up loop processing by using the apply family functions (in your case, it would probably be replicate). Have a look at the plyr package that provides progress bars.

Another option is to avoid loops altogether and replace them with vectorized arithmetics. I'm not sure exactly what you are doing, but you can probably apply your function to all rows at once:

temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]

This will be much much faster, and then you can filter the rows with your condition:

cond.i <- (temp[i, 6] == temp[i-1, 6]) & (temp[i, 3] == temp[i-1, 3])
temp[cond.i, 10] <- temp[cond.i, 9]

Vectorized arithmetics requires more time and thinking about the problem, but then you can sometimes save several orders of magnitude in execution time.

尐偏执 2024-09-09 22:39:13

看一下 {purrr} 中的 accumulate() 函数:

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}

Take a look at the accumulate() function from {purrr} :

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}
却一份温柔 2024-09-09 22:39:13

使用 data.table 进行处理是一个可行的选择:

n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")

library(data.table)

dayloop2.dt <- function(df) {
  dt <- data.table(df)
  dt[, Kumm. := {
    res <- .I;
    ifelse (res > 1,             
      ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) , 
        res <- col9 + shift(res)                   
      , # else
        res <- col9                                 
      )
     , # else
      res <- col9
    )
  }
  ,]
  res <- data.frame(dt)
  return (res)
}

res <- dayloop2.dt(df)

m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
#       expr      min        lq     mean   median       uq      max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042    10

如果您忽略条件过滤可能带来的好处,那么速度会非常快。显然,如果您可以对数据子集进行计算,那就很有帮助。

Processing with data.table is a viable option:

n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")

library(data.table)

dayloop2.dt <- function(df) {
  dt <- data.table(df)
  dt[, Kumm. := {
    res <- .I;
    ifelse (res > 1,             
      ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) , 
        res <- col9 + shift(res)                   
      , # else
        res <- col9                                 
      )
     , # else
      res <- col9
    )
  }
  ,]
  res <- data.frame(dt)
  return (res)
}

res <- dayloop2.dt(df)

m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
#       expr      min        lq     mean   median       uq      max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042    10

If you ignore the possible gains from conditions filtering, it is very fast. Obviously, if you can do the calculation on the subset of data, it helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文