加速R中的循环操作
我在 R 中遇到了很大的性能问题。我编写了一个迭代 data.frame
对象的函数。它只是向 data.frame
添加一个新列并积累一些内容。 (操作简单)。 data.frame
大约有 850K 行。我的电脑仍在工作(现在大约 10 小时),我不知道运行时间。
dayloop2 <- function(temp){
for (i in 1:nrow(temp)){
temp[i,10] <- i
if (i > 1) {
if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) {
temp[i,10] <- temp[i,9] + temp[i-1,10]
} else {
temp[i,10] <- temp[i,9]
}
} else {
temp[i,10] <- temp[i,9]
}
}
names(temp)[names(temp) == "V10"] <- "Kumm."
return(temp)
}
有什么想法可以加快此操作吗?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
最大的问题和无效的根源是索引 data.frame,我的意思是使用
temp[,]
的所有这些行。尽量避免这种情况。我采用了您的函数,更改了索引,在这里 version_A
正如您所看到的,我创建了收集结果的向量
res
。最后,我将其添加到data.frame
中,并且不需要弄乱名称。那么如何更好呢?
我使用
nrow
从 1,000 到 10,000 × 1,000 运行data.frame
的每个函数,并使用system.time
测量时间,结果为
您可以看到您的版本与
nrow(X)
呈指数关系。修改后的版本具有线性关系,简单的lm
模型预测850,000行的计算需要6分10秒。矢量化的力量
正如 Shane 和 Calimo 在他们的回答中指出的那样,矢量化是获得更好性能的关键。
从您的代码中,您可以移出循环:
temp[i,9]
)这将导致此代码
比较此函数的结果,这次是
nrow
从 10,000 到 100,000,增加 10,000。调整调整的
另一个调整是更改循环索引
temp[i,9]
到res[i]
(在第 i 次循环迭代中完全相同)。这又是索引向量和索引
data.frame
之间的区别。第二件事:当您查看循环时,您可以看到不需要循环所有
i
,而只需循环那些符合条件的。因此,我们获得
的性能很大程度上取决于数据结构。精确 - 条件中
TRUE
值的百分比。对于我的模拟数据,850,000 行的计算时间低于一秒。
我希望你能更进一步,我看到至少有两件事可以做:
C
执行条件求和的代码如果您知道数据中的最大序列并不大,那么您可以将循环更改为矢量化 while,例如
用于模拟和数字的代码是 可在 GitHub 上获取。
Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use
temp[,]
.Try to avoid this as much as possible. I took your function, change indexing and here version_A
As you can see I create vector
res
which gather results. At the end I add it todata.frame
and I don't need to mess with names.So how better is it?
I run each function for
data.frame
withnrow
from 1,000 to 10,000 by 1,000 and measure time withsystem.time
Result is
You can see that your version depends exponentially from
nrow(X)
. Modified version has linear relation, and simplelm
model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.Power of vectorization
As Shane and Calimo states in theirs answers vectorization is a key to better performance.
From your code you could move outside of loop:
temp[i,9]
)This leads to this code
Compare result for this functions, this time for
nrow
from 10,000 to 100,000 by 10,000.Tuning the tuned
Another tweak is to changing in a loop indexing
temp[i,9]
tores[i]
(which are exact the same in i-th loop iteration).It's again difference between indexing a vector and indexing a
data.frame
.Second thing: when you look on the loop you can see that there is no need to loop over all
i
, but only for the ones that fit condition.So here we go
Performance which you gain highly depends on a data structure. Precisely - on percent of
TRUE
values in the condition.For my simulated data it takes computation time for 850,000 rows below the one second.
I you want you can go further, I see at least two things which can be done:
C
code to do conditional cumsumif you know that in your data max sequence isn't large then you can change loop to vectorized while, something like
Code used for simulations and figures is available on GitHub.
加速 R 代码的一般策略
首先,找出慢的部分到底在哪里。无需优化运行速度不慢的代码。对于少量的代码,简单地思考一下就可以了。如果失败,RProf 和类似的分析工具可能会有所帮助。
一旦找出瓶颈,请考虑更有效的算法来完成您想要的事情。如果可能的话,计算应该只运行一次,因此:
使用更多 高效功能可以产生中等或较大的速度增益。例如,
paste0
产生了较小的效率增益,但.colSums()
及其相关函数产生了更明显的增益。mean
是特别慢 。这样您就可以避免一些特别常见的麻烦:
cbind
会很快减慢您的速度。时间。
尝试更好的矢量化,这通常可以但并不总是有帮助。在这方面,固有的向量化命令,例如
ifelse
、diff
等,将比apply
系列命令提供更多改进(这些命令提供的功能很少)与编写良好的循环相比没有速度提升)。您还可以尝试向 R 函数提供更多信息。例如,使用
vapply
而不是sapply
,并指定
colClasses
读取基于文本的数据时。速度增益将根据您消除猜测的程度而变化。接下来,考虑优化包:
data.table
包可以在数据操作和读取大量数据 (
fread
) 中使用时产生巨大的速度增益。接下来,尝试通过更有效的调用 R 的方式来提高速度:
Ra
和jit
包进行即时编译(Dirk 在 此演示文稿)。最后,如果上述所有方法仍然不能让您达到所需的速度,您可能需要转向更快的语言来处理缓慢的代码片段。这里将
Rcpp
和inline
结合起来,使得用 C++ 代码替换算法中最慢的部分变得特别容易。例如,这是我第一次尝试这样做,它甚至击败了高度优化的 R 解决方案。如果在这一切之后您仍然遇到麻烦,那么您只需要更多的计算能力。研究并行化 (http://cran.r- project.org/web/views/HighPerformanceComputing.html),甚至基于 GPU 的解决方案(
gpu-tools
)。其他指南的链接
General strategies for speeding up R code
First, figure out where the slow part really is. There's no need to optimize code that isn't running slowly. For small amounts of code, simply thinking through it can work. If that fails, RProf and similar profiling tools can be helpful.
Once you figure out the bottleneck, think about more efficient algorithms for doing what you want. Calculations should be only run once if possible, so:
Using more efficient functions can produce moderate or large speed gains. For instance,
paste0
produces a small efficiency gain but.colSums()
and its relatives produce somewhat more pronounced gains.mean
is particularly slow.Then you can avoid some particularly common troubles:
cbind
will slow you down really quickly.time.
Try for better vectorization, which can often but not always help. In this regard, inherently vectorized commands like
ifelse
,diff
, and the like will provide more improvement than theapply
family of commands (which provide little to no speed boost over a well-written loop).You can also try to provide more information to R functions. For instance, use
vapply
rather thansapply
, and specifycolClasses
when reading in text-based data. Speed gains will be variable depending on how much guessing you eliminate.Next, consider optimized packages: The
data.table
package can produce massive speed gains where its use is possible, in data manipulation and in reading large amounts of data (fread
).Next, try for speed gains through more efficient means of calling R:
Ra
andjit
packages in concert for just-in-time compilation (Dirk has an example in this presentation).And lastly, if all of the above still doesn't get you quite as fast as you need, you may need to move to a faster language for the slow code snippet. The combination of
Rcpp
andinline
here makes replacing only the slowest part of the algorithm with C++ code particularly easy. Here, for instance, is my first attempt at doing so, and it blows away even highly optimized R solutions.If you're still left with troubles after all this, you just need more computing power. Look into parallelization (http://cran.r-project.org/web/views/HighPerformanceComputing.html) or even GPU-based solutions (
gpu-tools
).Links to other guidance
如果您使用
for
循环,那么您很可能将 R 编码为 C 或 Java 或其他语言。正确矢量化的 R 代码速度非常快。以这两个简单的代码位为例,按顺序生成 10,000 个整数的列表:
第一个代码示例是如何使用传统编码范例对循环进行编码。完成需要 28 秒
您可以通过预分配内存的简单操作获得近 100 倍的改进:
但是使用使用冒号运算符
:
的基本 R 向量运算,此操作几乎是瞬时的:If you are using
for
loops, you are most likely coding R as if it was C or Java or something else. R code that is properly vectorised is extremely fast.Take for example these two simple bits of code to generate a list of 10,000 integers in sequence:
The first code example is how one would code a loop using a traditional coding paradigm. It takes 28 seconds to complete
You can get an almost 100-times improvement by the simple action of pre-allocating memory:
But using the base R vector operation using the colon operator
:
this operation is virtually instantaneous:通过使用索引或嵌套的 ifelse() 语句跳过循环,可以使速度更快。
This could be made much faster by skipping the loops by using indexes or nested
ifelse()
statements.正如 Ari 在他的回答末尾提到的那样,
Rcpp
和inline
包使得加快速度变得非常容易。作为一个例子,尝试这个内联
代码(警告:未经测试):有一个类似的
#include
处理过程,您只需将一个参数传递给cxxfunction,如<代码>include=inc。真正酷的是它会为您完成所有链接和编译,因此原型制作速度非常快。
免责声明:我不完全确定 tmp 的类应该是数字而不是数字矩阵或其他东西。但我基本确定。
编辑:如果在此之后您仍然需要更快的速度,OpenMP 是一个适合
C++
。我还没有尝试从inline
使用它,但它应该可以工作。这个想法是,在n
核的情况下,让循环迭代k
由k % n
执行。合适的介绍可以在 Matloff 的R 编程艺术中找到,可此处,第 16 章,求助于 C。As Ari mentioned at the end of his answer, the
Rcpp
andinline
packages make it incredibly easy to make things fast. As an example, try thisinline
code (warning: not tested):There's a similar procedure for
#include
ing things, where you just pass a parameterto cxxfunction, as
include=inc
. What's really cool about this is that it does all of the linking and compilation for you, so prototyping is really fast.Disclaimer: I'm not totally sure that the class of tmp should be numeric and not numeric matrix or something else. But I'm mostly sure.
Edit: if you still need more speed after this, OpenMP is a parallelization facility good for
C++
. I haven't tried using it frominline
, but it should work. The idea would be to, in the case ofn
cores, have loop iterationk
be carried out byk % n
. A suitable introduction is found in Matloff's The Art of R Programming, available here, in chapter 16, Resorting to C.我不喜欢重写代码...当然 ifelse 和 lapply 是更好的选择,但有时很难做到这一点。
我经常使用 data.frames,就像使用诸如
df$var[i]
这样的列表,这是一个虚构的示例:
data.frame version:
list version:
17x times fast to use a list of 矢量比数据框。
关于为什么内部 data.frames 在这方面如此缓慢有什么评论吗?人们可能会认为它们像列表一样操作...
为了更快的代码,可以使用
class(d)='list'
而不是d=as.list(d)
和 <代码>class(d)='data.frame'I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.
Frequently I use data.frames as one would use lists such as
df$var[i]
Here is a made up example:
data.frame version:
list version:
17x times faster to use a list of vectors than a data.frame.
Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...
For even faster code do this
class(d)='list'
instead ofd=as.list(d)
andclass(d)='data.frame'
这里的答案很棒。未涵盖的一个小方面是问题指出“我的电脑仍在工作(现在大约 10 小时),我不知道运行时间”。在开发时,我总是将以下代码放入循环中,以了解更改如何影响速度,并监控完成所需的时间。
也适用于 lapply。
如果循环内的函数非常快但循环数量很大,那么请考虑每隔一段时间打印一次,因为打印到控制台本身会产生开销。例如
The answers here are great. One minor aspect not covered is that the question states "My PC is still working (about 10h now) and I have no idea about the runtime". I always put in the following code into loops when developing to get a feel for how changes seem to affect the speed and also for monitoring how long it will take to complete.
Works with lapply as well.
If the function within the loop is quite fast but the number of loops is large then consider just printing every so often as printing to the console itself has an overhead. e.g.
在 R 中,您通常可以使用
apply
系列函数来加速循环处理(在您的情况下,它可能是replicate
)。看一下提供进度条的plyr
包。另一种选择是完全避免循环并用矢量化算术代替它们。我不确定你到底在做什么,但你可能可以一次将你的函数应用到所有行:
这会快得多,然后你可以用你的条件过滤行:
矢量化算术需要更多的时间和思考问题,但有时可以节省几个数量级的执行时间。
In R, you can often speed-up loop processing by using the
apply
family functions (in your case, it would probably bereplicate
). Have a look at theplyr
package that provides progress bars.Another option is to avoid loops altogether and replace them with vectorized arithmetics. I'm not sure exactly what you are doing, but you can probably apply your function to all rows at once:
This will be much much faster, and then you can filter the rows with your condition:
Vectorized arithmetics requires more time and thinking about the problem, but then you can sometimes save several orders of magnitude in execution time.
看一下
{purrr}
中的accumulate()
函数:Take a look at the
accumulate()
function from{purrr}
:使用
data.table
进行处理是一个可行的选择:如果您忽略条件过滤可能带来的好处,那么速度会非常快。显然,如果您可以对数据子集进行计算,那就很有帮助。
Processing with
data.table
is a viable option:If you ignore the possible gains from conditions filtering, it is very fast. Obviously, if you can do the calculation on the subset of data, it helps.