R：加速“group by”运营

发布于 2024-09-18 18:29:20 字数 787 浏览 2 评论 0原文

我有一个模拟，中间有一个巨大的聚合和组合步骤。我使用 plyr 的 ddply() 函数对这个过程进行了原型设计，它可以很好地满足我的大部分需求。但我需要更快的聚合步骤，因为我必须运行 10K 次模拟。我已经在并行扩展模拟，但如果这一步骤更快，我可以大大减少所需的节点数量。

这是我想做的事情的合理简化：

library(Hmisc)

# Set up some example data
year <-    sample(1970:2008, 1e6, rep=T)
state <-   sample(1:50, 1e6, rep=T)
group1 <-  sample(1:6, 1e6, rep=T)
group2 <-  sample(1:3, 1e6, rep=T)
myFact <-  rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)

# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
                     function(df) wtd.mean(df$myFact, weights=df$weights)
                                 )
           )

感谢所有提示或建议！

原文

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need.

Here's a reasonable simplification of what I am trying to do:

library(Hmisc)

# Set up some example data
year <-    sample(1970:2008, 1e6, rep=T)
state <-   sample(1:50, 1e6, rep=T)
group1 <-  sample(1:6, 1e6, rep=T)
group2 <-  sample(1:3, 1e6, rep=T)
myFact <-  rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)

# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
                     function(df) wtd.mean(df$myFact, weights=df$weights)
                                 )
           )

All tips or suggestions are appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

星軌x 2024-09-25 18:29:20

您可以使用不可变的数据帧，而不是普通的 R 数据帧，它在子集化时返回指向原始数据的指针，并且速度更快：

idf <- idata.frame(myDF)
system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
   function(df) wtd.mean(df$myFact, weights=df$weights)))

#    user  system elapsed 
# 18.032   0.416  19.250

如果我要编写一个专门针对这种情况定制的 plyr 函数，我会做类似的事情这：

system.time({
  ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
  data <- as.matrix(myDF[c("myFact", "weights")])
  indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))

  fun <- function(rows) {
    weighted.mean(data[rows, 1], data[rows, 2])
  }
  values <- vapply(indices, fun, numeric(1))

  labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
    c("year", "state", "group1", "group2")]
  aggregateDF <- cbind(labels, values)
})

# user  system elapsed 
# 2.04    0.29    2.33

它的速度要快得多，因为它避免了复制数据，只在计算时提取每个计算所需的子集。将数据转换为矩阵形式可以进一步提高速度，因为矩阵子集比数据帧子集要快得多。

Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

idf <- idata.frame(myDF)
system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
   function(df) wtd.mean(df$myFact, weights=df$weights)))

#    user  system elapsed 
# 18.032   0.416  19.250

If I was to write a plyr function customised exactly to this situation, I'd do something like this:

system.time({
  ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
  data <- as.matrix(myDF[c("myFact", "weights")])
  indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))

  fun <- function(rows) {
    weighted.mean(data[rows, 1], data[rows, 2])
  }
  values <- vapply(indices, fun, numeric(1))

  labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
    c("year", "state", "group1", "group2")]
  aggregateDF <- cbind(labels, values)
})

# user  system elapsed 
# 2.04    0.29    2.33

It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

回复收藏 0 原文

调妓 2024-09-25 18:29:20

进一步 2 倍加速和更简洁的代码：

library(data.table)
dtb <- data.table(myDF, key="year,state,group1,group2")
system.time( 
  res <- dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)] 
)
#   user  system elapsed 
#  0.950   0.050   1.007

我的第一篇文章，所以请友善;)

从 data.table v1.9.2 开始，导出了 setDT 函数，该函数将转换 data.frame 到 data.table 通过引用（与 data.table 说法保持一致 - 所有集* 函数通过引用修改对象）。这意味着没有不必要的复制，因此速度很快。你可以计时，但会疏忽。

require(data.table)
system.time({
  setDT(myDF)
  res <- myDF[, weighted.mean(myFact, weights), 
             by=list(year, state, group1, group2)] 
})
#   user  system elapsed 
#  0.970   0.024   1.015

这与上面 OP 解决方案的 1.264 秒相反，其中使用 data.table(.) 创建 dtb。

Further 2x speedup and more concise code:

library(data.table)
dtb <- data.table(myDF, key="year,state,group1,group2")
system.time( 
  res <- dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)] 
)
#   user  system elapsed 
#  0.950   0.050   1.007

My first post, so please be nice ;)

From data.table v1.9.2, setDT function is exported that'll convert data.frame to data.table by reference (in keeping with data.table parlance - all set* functions modify the object by reference). This means, no unnecessary copying, and is therefore fast. You can time it, but it'll be negligent.

require(data.table)
system.time({
  setDT(myDF)
  res <- myDF[, weighted.mean(myFact, weights), 
             by=list(year, state, group1, group2)] 
})
#   user  system elapsed 
#  0.970   0.024   1.015

This is as opposed to 1.264 seconds with OP's solution above, where data.table(.) is used to create dtb.

回复收藏 0 原文

与之呼应 2024-09-25 18:29:20

我将使用基本 R 进行分析，

g <- with(myDF, paste(year, state, group1, group2))
x <- with(myDF, c(tapply(weights * myFact, g, sum) / tapply(weights, g, sum)))
aggregateDF <- myDF[match(names(x), g), c("year", "state", "group1", "group2")]
aggregateDF$V1 <- x

在我的机器上，它需要 5 秒，而原始代码则需要 67 秒。

编辑
刚刚发现 rowsum 函数的另一个加速效果：

g <- with(myDF, paste(year, state, group1, group2))
X <- with(myDF, rowsum(data.frame(a=weights*myFact, b=weights), g))
x <- X$a/X$b
aggregateDF2 <- myDF[match(rownames(X), g), c("year", "state", "group1", "group2")]
aggregateDF2$V1 <- x

需要 3 秒！

I would profile with base R

g <- with(myDF, paste(year, state, group1, group2))
x <- with(myDF, c(tapply(weights * myFact, g, sum) / tapply(weights, g, sum)))
aggregateDF <- myDF[match(names(x), g), c("year", "state", "group1", "group2")]
aggregateDF$V1 <- x

On my machine it takes 5sec compare to 67sec with original code.

EDIT
Just found another speed up with rowsum function:

g <- with(myDF, paste(year, state, group1, group2))
X <- with(myDF, rowsum(data.frame(a=weights*myFact, b=weights), g))
x <- X$a/X$b
aggregateDF2 <- myDF[match(rownames(X), g), c("year", "state", "group1", "group2")]
aggregateDF2$V1 <- x

It takes 3sec!

回复收藏 0 原文

莫相离 2024-09-25 18:29:20

您是否使用最新版本的 plyr（注意：这尚未适用于所有 CRAN 镜像）？如果是这样，您可以并行运行它。

这是 llply 示例，但同样适用于 ddply：

  x <- seq_len(20)
  wait <- function(i) Sys.sleep(0.1)
  system.time(llply(x, wait))
  #  user  system elapsed 
  # 0.007   0.005   2.005 

  library(doMC)
  registerDoMC(2) 
  system.time(llply(x, wait, .parallel = TRUE))
  #  user  system elapsed 
  # 0.020   0.011   1.038

编辑：

嗯，其他循环方法更糟糕，因此这可能需要 (a) C/C++ 代码或 (b) 更基本的重新思考你是如何做的。我什至没有尝试使用 by() 因为根据我的经验，这非常慢。

groups <- unique(myDF[,c("year", "state", "group1", "group2")])
system.time(
aggregateDF <- do.call("rbind", lapply(1:nrow(groups), function(i) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))
}))
)

aggregateDF <- data.frame()
system.time(
for(i in 1:nrow(groups)) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   aggregateDF <- rbind(aggregateDF, data.frame(cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))))
}
)

Are you using the latest version of plyr (note: this hasn't made it to all the CRAN mirrors yet)? If so, you could just run this in parallel.

Here's the llply example, but the same should apply to ddply:

  x <- seq_len(20)
  wait <- function(i) Sys.sleep(0.1)
  system.time(llply(x, wait))
  #  user  system elapsed 
  # 0.007   0.005   2.005 

  library(doMC)
  registerDoMC(2) 
  system.time(llply(x, wait, .parallel = TRUE))
  #  user  system elapsed 
  # 0.020   0.011   1.038

Edit:

Well, other looping approaches are worse, so this probably requires either (a) C/C++ code or (b) a more fundamental rethinking of how you're doing it. I didn't even try using by() because that's very slow in my experience.

groups <- unique(myDF[,c("year", "state", "group1", "group2")])
system.time(
aggregateDF <- do.call("rbind", lapply(1:nrow(groups), function(i) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))
}))
)

aggregateDF <- data.frame()
system.time(
for(i in 1:nrow(groups)) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   aggregateDF <- rbind(aggregateDF, data.frame(cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))))
}
)

回复收藏 0 原文

烟凡古楼 2024-09-25 18:29:20

当所应用的函数具有多个向量参数时，我通常将索引向量与tapply一起使用：

system.time(tapply(1:nrow(myDF), myDF[c('year', 'state', 'group1', 'group2')], function(s) weighted.mean(myDF$myFact[s], myDF$weights[s])))
# user  system elapsed 
# 1.36    0.08    1.44

我使用一个简单的包装器，它是等效的，但隐藏了混乱：

tmapply(list(myDF$myFact, myDF$weights), myDF[c('year', 'state', 'group1', 'group2')], weighted.mean)

编辑以包括 tmapply 以供下面的评论：

tmapply = function(XS, INDEX, FUN, ..., simplify=T) {
  FUN = match.fun(FUN)
  if (!is.list(XS))
    XS = list(XS)
  tapply(1:length(XS[[1L]]), INDEX, function(s, ...)
    do.call(FUN, c(lapply(XS, `[`, s), list(...))), ..., simplify=simplify)
}

I usually use an index vector with tapply when the function being applied has multiple vector args:

system.time(tapply(1:nrow(myDF), myDF[c('year', 'state', 'group1', 'group2')], function(s) weighted.mean(myDF$myFact[s], myDF$weights[s])))
# user  system elapsed 
# 1.36    0.08    1.44

I use a simple wrapper which is equivalent but hides the mess:

tmapply(list(myDF$myFact, myDF$weights), myDF[c('year', 'state', 'group1', 'group2')], weighted.mean)

Edited to include tmapply for comment below:

tmapply = function(XS, INDEX, FUN, ..., simplify=T) {
  FUN = match.fun(FUN)
  if (!is.list(XS))
    XS = list(XS)
  tapply(1:length(XS[[1L]]), INDEX, function(s, ...)
    do.call(FUN, c(lapply(XS, `[`, s), list(...))), ..., simplify=simplify)
}

回复收藏 0 原文

緦唸λ蓇 2024-09-25 18:29:20

最快的解决方案可能是使用collapse::fgroup_by。它比 data.table 快 8 倍：

library(collapse)
myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights))

bm <- bench::mark(
  collapse = myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights)),
  data.table = dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)],
  check = FALSE)

#> bm
#  expression      min   median itr/se…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴
#1 collapse      101ms    105ms     9.10  8.84MB    0        5     0   549ms
#2 data.table    852ms    852ms     1.17 24.22MB    2.35     1     2   852ms

Probably the fastest solution is to use collapse::fgroup_by. It's 8x faster than data.table:

library(collapse)
myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights))

bm <- bench::mark(
  collapse = myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights)),
  data.table = dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)],
  check = FALSE)

#> bm
#  expression      min   median itr/se…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴
#1 collapse      101ms    105ms     9.10  8.84MB    0        5     0   549ms
#2 data.table    852ms    852ms     1.17 24.22MB    2.35     1     2   852ms

回复收藏 0 原文

~没有更多了~