R:加速“group by”运营
我有一个模拟,中间有一个巨大的聚合和组合步骤。我使用 plyr 的 ddply() 函数对这个过程进行了原型设计,它可以很好地满足我的大部分需求。但我需要更快的聚合步骤,因为我必须运行 10K 次模拟。我已经在并行扩展模拟,但如果这一步骤更快,我可以大大减少所需的节点数量。
这是我想做的事情的合理简化:
library(Hmisc)
# Set up some example data
year <- sample(1970:2008, 1e6, rep=T)
state <- sample(1:50, 1e6, rep=T)
group1 <- sample(1:6, 1e6, rep=T)
group2 <- sample(1:3, 1e6, rep=T)
myFact <- rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)
# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
function(df) wtd.mean(df$myFact, weights=df$weights)
)
)
感谢所有提示或建议!
I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need.
Here's a reasonable simplification of what I am trying to do:
library(Hmisc)
# Set up some example data
year <- sample(1970:2008, 1e6, rep=T)
state <- sample(1:50, 1e6, rep=T)
group1 <- sample(1:6, 1e6, rep=T)
group2 <- sample(1:3, 1e6, rep=T)
myFact <- rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)
# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
function(df) wtd.mean(df$myFact, weights=df$weights)
)
)
All tips or suggestions are appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用不可变的数据帧,而不是普通的 R 数据帧,它在子集化时返回指向原始数据的指针,并且速度更快:
如果我要编写一个专门针对这种情况定制的 plyr 函数,我会做类似的事情这:
它的速度要快得多,因为它避免了复制数据,只在计算时提取每个计算所需的子集。将数据转换为矩阵形式可以进一步提高速度,因为矩阵子集比数据帧子集要快得多。
Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:
If I was to write a plyr function customised exactly to this situation, I'd do something like this:
It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.
进一步 2 倍加速和更简洁的代码:
我的第一篇文章,所以请友善;)
从
data.table
v1.9.2 开始,导出了setDT
函数,该函数将转换data.frame
到data.table
通过引用(与data.table
说法保持一致 - 所有集*
函数通过引用修改对象)。这意味着没有不必要的复制,因此速度很快。你可以计时,但会疏忽。这与上面 OP 解决方案的 1.264 秒相反,其中使用
data.table(.)
创建dtb
。Further 2x speedup and more concise code:
My first post, so please be nice ;)
From
data.table
v1.9.2,setDT
function is exported that'll convertdata.frame
todata.table
by reference (in keeping withdata.table
parlance - allset*
functions modify the object by reference). This means, no unnecessary copying, and is therefore fast. You can time it, but it'll be negligent.This is as opposed to 1.264 seconds with OP's solution above, where
data.table(.)
is used to createdtb
.我将使用基本 R 进行分析,
在我的机器上,它需要 5 秒,而原始代码则需要 67 秒。
编辑
刚刚发现 rowsum 函数的另一个加速效果:
需要 3 秒!
I would profile with base R
On my machine it takes 5sec compare to 67sec with original code.
EDIT
Just found another speed up with
rowsum
function:It takes 3sec!
您是否使用最新版本的 plyr(注意:这尚未适用于所有 CRAN 镜像)?如果是这样,您可以并行运行它。
这是 llply 示例,但同样适用于 ddply:
编辑:
嗯,其他循环方法更糟糕,因此这可能需要 (a) C/C++ 代码或 (b) 更基本的重新思考你是如何做的。我什至没有尝试使用
by()
因为根据我的经验,这非常慢。Are you using the latest version of plyr (note: this hasn't made it to all the CRAN mirrors yet)? If so, you could just run this in parallel.
Here's the llply example, but the same should apply to ddply:
Edit:
Well, other looping approaches are worse, so this probably requires either (a) C/C++ code or (b) a more fundamental rethinking of how you're doing it. I didn't even try using
by()
because that's very slow in my experience.当所应用的函数具有多个向量参数时,我通常将索引向量与tapply一起使用:
我使用一个简单的包装器,它是等效的,但隐藏了混乱:
编辑以包括 tmapply 以供下面的评论:
I usually use an index vector with tapply when the function being applied has multiple vector args:
I use a simple wrapper which is equivalent but hides the mess:
Edited to include tmapply for comment below:
最快的解决方案可能是使用
collapse::fgroup_by
。它比data.table
快 8 倍:Probably the fastest solution is to use
collapse::fgroup_by
. It's 8x faster thandata.table
: