用于按组 ID 子集数据的 for 循环的更高性能替代方案是什么?

发布于 2024-08-26 13:23:19 字数 3229 浏览 7 评论 0原文

我在研究中遇到的一个反复出现的分析范式是需要根据所有不同的组 ID 值进行子集化,依次对每个组进行统计分析,并将结果放入输出矩阵中以供进一步处理/总结。

我通常在 R 中执行此操作的方式如下所示:

data.mat <- read.csv("...")  
groupids <- unique(data.mat$ID)  #Assume there are then 100 unique groups
  
results <- matrix(rep("NA",300),ncol=3,nrow=100)  

for(i in 1:100) {  
  tempmat <- subset(data.mat,ID==groupids[i])  

  # Run various stats on tempmat (correlations, regressions, etc), checking to  
  # make sure this specific group doesn't have NAs in the variables I'm using  
  # and assign results to x, y, and z, for example.  

  results[i,1] <- x  
  results[i,2] <- y  
  results[i,3] <- z  
}

这最终对我有用,但根据数据的大小和我正在处理的组的数量,这可能需要长达三天的时间。

除了分支到并行处理之外,还有什么“技巧”可以让这样的东西运行得更快吗?例如,将循环转换为其他内容(例如带有包含我想在循环内运行的统计信息的函数的 apply ),或者消除将数据子集实际分配给变量的需要?

编辑:

也许这只是常识(或采样错误),但我尝试在一些代码中使用括号进行子集化,而不是使用子集命令,它似乎提供了轻微的性能增益,这让我感到惊讶。我使用了一些代码,并使用与上面相同的对象名称在下面输出:

system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})  
 用户系统已失效  
 361.41 92.62 458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})  
 用户系统已失效   
 378.44 102.03 485.94

更新:

在其中一个答案中,jorgusch 建议我使用 data.table 包来加速我的子集设置。因此,我将其应用于本周早些时候运行的一个问题。在具有略多于 1,500,000 行和 4 列(ID、Var1、Var2、Var3)的数据集中,我想计算每组中的两个相关性(由“ID”变量索引)。群组数量略多于 50,000 个。下面是我的初始代码(与上面的非常相似):

data.mat <- read.csv("//home....")  
groupids <- unique(data.mat$ID)
  
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))  

for(i in 1:length(groupids)) {  
  tempmat <- data.mat[data.mat$ID==groupids[i],] 

  results[i,1] <- groupids[i]  
  results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")  
  results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")    

}  

我现在正在重新运行它,以精确测量所花费的时间,但据我记得,我进入办公室时就开始运行它早上完成,下午某个时候结束。图5-7小时。

重组我的代码以使用 data.table...

data.mat <- read.csv("//home....")  
data.mat <- data.table(data.mat)  
  
testfunc <- function(x,y,z) {  
  temp1 <- cor(x,y,use="pairwise.complete.obs")  
  temp2 <- cor(x,z,use="pairwise.complete.obs")  
  res <- list(temp1,temp2)  
  res  
}  

system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])  
 用户系统已失效  
16.41 0.05 17.44  

将使用 data.table 的结果与我使用 for 循环对所有 ID 进行子集化并手动记录结果得到的结果进行比较,它们似乎给了我相同的答案(尽管我必须更彻底地检查一下)。这看起来是一个相当大的速度提升。

更新 2:

使用子集运行代码终于再次完成:

 用户系统已失效  
17575.79 4247.41 23477.00

更新 3:

我想看看使用同样推荐的 plyr 包是否有不同的结果。这是我第一次使用它,所以我可能做的事情效率有些低,但与带有子集的 for 循环相比,它仍然有很大帮助。

使用与以前相同的变量和设置......

data.mat <- read.csv("//home....")  
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2,  use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))  
 用户系统已失效  
250.25 7.35 272.09  

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.

How I typically do this in R is something like the following:

data.mat <- read.csv("...")  
groupids <- unique(data.mat$ID)  #Assume there are then 100 unique groups
  
results <- matrix(rep("NA",300),ncol=3,nrow=100)  

for(i in 1:100) {  
  tempmat <- subset(data.mat,ID==groupids[i])  

  # Run various stats on tempmat (correlations, regressions, etc), checking to  
  # make sure this specific group doesn't have NAs in the variables I'm using  
  # and assign results to x, y, and z, for example.  

  results[i,1] <- x  
  results[i,2] <- y  
  results[i,3] <- z  
}

This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.

Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?

Edit:

Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:

system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})  
   user  system elapsed  
 361.41   92.62  458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})  
   user  system elapsed   
 378.44  102.03  485.94

Update:

In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):

data.mat <- read.csv("//home....")  
groupids <- unique(data.mat$ID)
  
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))  

for(i in 1:length(groupids)) {  
  tempmat <- data.mat[data.mat$ID==groupids[i],] 

  results[i,1] <- groupids[i]  
  results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")  
  results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")    

}  

I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in the mid-afternoon. Figure 5-7 hours.

Restructuring my code to use data.table....

data.mat <- read.csv("//home....")  
data.mat <- data.table(data.mat)  
  
testfunc <- function(x,y,z) {  
  temp1 <- cor(x,y,use="pairwise.complete.obs")  
  temp2 <- cor(x,z,use="pairwise.complete.obs")  
  res <- list(temp1,temp2)  
  res  
}  

system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])  
 user  system  elapsed  
16.41    0.05    17.44  

Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.

Update 2:

Running the code using subsets finally finished up again:

   user     system   elapsed  
17575.79  4247.41   23477.00

Update 3:

I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.

Using the same variables and setup as before...

data.mat <- read.csv("//home....")  
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2,  use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))  
  user  system elapsed  
250.25    7.35  272.09  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

陌生 2024-09-02 13:23:19

这几乎正​​是 plyr 包的设计目的,使之变得更容易。然而,它不太可能使事情变得更快——大部分时间可能都花在了统计上。

This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.

梦言归人 2024-09-02 13:23:19

除了 plyr 之外,您还可以尝试使用 < code>foreach 包来排除显式循环计数器,但我不知道它是否会给您带来任何性能优势。

如果您有多核工作站(带有 doMC/multicore 软件包),Foreach 永远不会为您提供一个非常简单的并行块处理接口(检查< a href="http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf" rel="nofollow noreferrer">doMC 和 foreach 入门了解详细信息),如果您排除并行处理只是因为它对学生来说不太容易理解。如果这不是唯一的原因,恕我直言,plyr 是非常好的解决方案。

Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.

Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.

永不分离 2024-09-02 13:23:19

就我个人而言,我觉得plyr不太容易理解。我更喜欢 data.table,它也更快。例如,您想要为每个 ID 计算列 my_column 的标准差。

dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column 

三个这样的语句和最后一个 cbind - 这就是您所需要的。
您还可以使用 dt 仅对一个 ID 执行某些操作,而无需使用新语法中的子集命令:

result.sd.oneiD<- dt[ID="oneID",sd(my_column)]  

第一个语句引用行 (i),第二个语句引用列 (j)。

如果发现它更容易阅读,那么播放器也更灵活,因为您还可以在“子集”内执行子域...
文档描述它使用类似 SQL 的方法。例如,by 几乎就是 SQL 中的“group by”。好吧,如果您了解 SQL,您可能可以做更多的事情,但没有必要使用该包。
最后,它的速度非常快,因为每个操作不仅是并行的,而且 data.table 会抓取计算所需的数据。然而,子集保持整个矩阵的级别并将其拖动到内存中。

Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.

dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column 

Three statements of this kind and a cbind at the end - that is all you need.
You can also use dt do some action for only one ID without a subset command in an new syntax:

result.sd.oneiD<- dt[ID="oneID",sd(my_column)]  

The first statment refers to rows (i), the second to columns (j).

If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"...
The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package.
Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.

百善笑为先 2024-09-02 13:23:19

您已经建议矢量化并避免对中间结果进行不必要的复制,因此您肯定走在正确的轨道上。让我警告您不要做我所做的事情,只是假设矢量化将总是给您带来性能提升(就像在其他语言中一样,例如Python + NumPy, MATLAB)。

示例:

# small function to time the results:
time_this = function(...) {
  start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent()))); 
  end.time = Sys.time(); print(end.time - start.time)
}

# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)

# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )

print(x)    # returns => 0.5312099
print(y)    # returns => 0.661242

“apply”版本实际上比“for”版本。 (根据Inferno作者的说法,如果你这样做,你就不是矢量化,而是“循环隐藏”。)

但是,你可以通过使用内置函数<来获得性能提升。 /强>。下面,我对与上面两个相同的操作进行了计时,仅使用内置函数“rowMeans”:

z = time_this(rowMeans(a))
print(z)    # returns => 0.03679609

与“for”循环(以及矢量化版本)相比,有了一个数量级的改进。

apply 系列的其他成员不仅仅是本机“for”循环的包装器。

a = abs(floor(10*rnorm(1e6)))

time_this(sapply(a, sqrt))
# returns => 6.64 secs

time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs

与“for”循环相比,“sapply”大约慢 5 倍

最后,w/r/t向量化与“for”循环相比,如果我可以使用向量化函数,我认为我永远不会使用循环——后者通常更少的击键,而且这是一种更自然的方式(对我来说)我想,编码是一种不同类型的性能提升。

You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and just assume that vectorizing will always give you a performance boost (like it does in other languages, e.g., Python + NumPy, MATLAB).

An example:

# small function to time the results:
time_this = function(...) {
  start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent()))); 
  end.time = Sys.time(); print(end.time - start.time)
}

# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)

# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )

print(x)    # returns => 0.5312099
print(y)    # returns => 0.661242

The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)

But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':

z = time_this(rowMeans(a))
print(z)    # returns => 0.03679609

An order of magnitude improvement versus the 'for' loop (and the vectorized version).

The other members of the apply family are not just wrappers over a native 'for' loop.

a = abs(floor(10*rnorm(1e6)))

time_this(sapply(a, sqrt))
# returns => 6.64 secs

time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs

'sapply' is about 5x slower compared with a 'for' loop.

Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文