如何加速这个 R 代码

发布于 2024-09-28 08:16:13 字数 529 浏览 3 评论 0原文

我有一个 data.frame(文件链接)有 18 列和 11520 行,我像这样转换:

library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), na.rm=TRUE)

根据 system.time(),运行大约需要这么长时间:

   user  system elapsed 
   5.16    0.00    5.17

此调用是 web 应用程序的一部分,因此运行时间非常重要。有没有办法加快这个调用的速度?

I have a data.frame (link to file) with 18 columns and 11520 rows that I transform like this:

library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), na.rm=TRUE)

according to system.time(), it takes about this long to run:

   user  system elapsed 
   5.16    0.00    5.17

This call is part of a webapp, so run time is pretty important. Is there a way to speed this call up?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

鹿童谣 2024-10-05 08:16:13

仅使用aggregate会快很多......

> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
> 
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   1.89    0.00    1.89 
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
   user  system elapsed 
   5.06    0.00    5.06 
> 
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
> 
> identical(ag.median, df.median)
[1] TRUE

Just using aggregate is quite a bit faster...

> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
> 
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   1.89    0.00    1.89 
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
   user  system elapsed 
   5.06    0.00    5.06 
> 
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
> 
> identical(ag.median, df.median)
[1] TRUE
海未深 2024-10-05 08:16:13

只是总结一下评论中的一些观点:

  1. 在开始优化之前,您应该对“可接受的”性能有一定的了解。根据所需的性能,您可以更具体地了解如何改进代码。例如,在某个阈值时,您需要停止使用 R 并转向编译语言。
  2. 一旦获得预期的运行时间,您就可以分析现有代码以查找潜在的瓶颈。 R 有几种机制可以实现这一点,包括 Rprof(如果您搜索 [r] + rprof)。
  3. plyr 的设计主要是为了易于使用,而不是为了性能(尽管最近的版本有一些不错的性能改进)。一些基本函数速度更快,因为它们的开销更少。 @JDLong 指出一个很好的线程涵盖了其中一些问题,包括一些哈德利的专业技术。

Just to summarize some of the points from the comments:

  1. Before you start to optimize, you should have some sense for "acceptable" performance. Depending upon the required performance, you can then be more specific about how to improve the code. For instance, at some threshold, you would need to stop using R and move onto a compiled language.
  2. Once you have an expected run-time, you can profile your existing code to find potential bottlenecks. R has several mechanisms for this, including Rprof (there are examples on stackoverflow if you search for [r] + rprof).
  3. plyr is designed primarily for ease-of-use, not for performance (although the recent version had some nice performance improvements). Some of the base functions are faster because they have less overhead. @JDLong pointed to a nice thread that covers some of these issues, including some specialized techniques from Hadley.
彩虹直至黑白 2024-10-05 08:16:13

计算中位数时,数据的顺序很重要:如果数据按从小到大的顺序排列,那么计算速度会更快一些。

x <- 1:1e6
y <- sample(x)
system.time(for(i in 1:1e2) median(x))
   user  system elapsed 
   3.47    0.33    3.80

system.time(for(i in 1:1e2) median(y))
   user  system elapsed 
   5.03    0.26    5.29

对于新数据集,导入时按适当的列对数据进行排序。对于现有数据集,您可以将它们作为批处理作业进行排序(在 Web 应用程序之外)。

The order of the data matter when you are calculating medians: if the data are in order from smallest to largest, then the calculation is a bit quicker.

x <- 1:1e6
y <- sample(x)
system.time(for(i in 1:1e2) median(x))
   user  system elapsed 
   3.47    0.33    3.80

system.time(for(i in 1:1e2) median(y))
   user  system elapsed 
   5.03    0.26    5.29

For the new datasets, sort the data by an appropriate column when you import it. For existing datasets you can sort them as a batch job (outside the web app).

ぽ尐不点ル 2024-10-05 08:16:13

添加到约书亚的解决方案中。如果您决定使用平均值而不是中位数,则可以将计算速度再加快 4 倍:

> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   3.472   0.020   3.615 
> system.time(ag.mean <- aggregate(data[,dataVars], data[,groupVars], mean))
   user  system elapsed 
   0.936   0.008   1.006 

To add to Joshua's solution. If you decide to use mean instead of median, you can speed up the computation another 4 times:

> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   3.472   0.020   3.615 
> system.time(ag.mean <- aggregate(data[,dataVars], data[,groupVars], mean))
   user  system elapsed 
   0.936   0.008   1.006 
一袭白衣梦中忆 2024-10-05 08:16:13

使用 dplyr 处理此数据要快得多:(

library(dplyr)

system.time({
  data %>% 
    group_by(groupname, starttime, fPhase, fCycle) %>%
    summarise_each(funs(median(., na.rm = TRUE)), inadist:larct)
})
#>    user  system elapsed 
#>   0.391   0.004   0.395

您需要 dplyr 0.2 才能获取 %>%summarise_each

这与 plyr 相比是有利的:

library(plyr)
system.time({
  df.median <- ddply(data, .(groupname, starttime, fPhase, fCycle), 
    numcolwise(median), na.rm = TRUE)
})
#>    user  system elapsed 
#>   0.991   0.004   0.996

并且与 aggregate() (代码来自@joshua-ulrich)

groupVars <- c("groupname", "starttime", "fPhase", "fCycle")
dataVars <- colnames(data)[ !(colnames(data) %in% c("location", groupVars))]
system.time({
  ag.median <- aggregate(data[,dataVars], data[,groupVars], median)
})
#>    user  system elapsed 
#>   0.532   0.005   0.537

Working with this data is considerably faster with dplyr:

library(dplyr)

system.time({
  data %>% 
    group_by(groupname, starttime, fPhase, fCycle) %>%
    summarise_each(funs(median(., na.rm = TRUE)), inadist:larct)
})
#>    user  system elapsed 
#>   0.391   0.004   0.395

(You'll need dplyr 0.2 to get %>% and summarise_each)

This compares favourable to plyr:

library(plyr)
system.time({
  df.median <- ddply(data, .(groupname, starttime, fPhase, fCycle), 
    numcolwise(median), na.rm = TRUE)
})
#>    user  system elapsed 
#>   0.991   0.004   0.996

And to aggregate() (code from @joshua-ulrich)

groupVars <- c("groupname", "starttime", "fPhase", "fCycle")
dataVars <- colnames(data)[ !(colnames(data) %in% c("location", groupVars))]
system.time({
  ag.median <- aggregate(data[,dataVars], data[,groupVars], median)
})
#>    user  system elapsed 
#>   0.532   0.005   0.537
年少掌心 2024-10-05 08:16:13

好吧,我只是使用标准库函数(例如“table”、“tapply”、“aggregate”等)和类似的 plyr 对大型数据框(plyr 包中的棒球数据集)进行了一些简单的转换函数——在每个实例中,我发现 plyr 明显慢一些。例如,

> system.time(table(BB$year))
    user  system elapsed 
   0.007   0.002   0.009 

> system.time(ddply(BB, .(year), 'nrow'))
    user  system elapsed 
   0.183   0.005   0.189 

第二,我没有调查这是否会提高您的情况下的性能,但对于您现在正在使用的大小和更大的数据帧,我使用data.table 库,可在 CRAN 上使用。创建 data.table 对象以及将现有 data.frames 转换为 data.tables 都很简单——只需在要转换的 data.frame 上调用 data.table 即可:

dt1 = data.table(my_dataframe)

Well i just did a few simple transformations on a large data frame (the baseball data set in the plyr package) using the standard library functions (e.g., 'table', 'tapply', 'aggregate', etc.) and the analogous plyr function--in each instance, i found plyr to be significantly slower. E.g.,

> system.time(table(BB$year))
    user  system elapsed 
   0.007   0.002   0.009 

> system.time(ddply(BB, .(year), 'nrow'))
    user  system elapsed 
   0.183   0.005   0.189 

Second, and i did not investigate whether this would improve performance in your case, but for data frames of the size you are working with now and larger, i use the data.table library, available on CRAN. It is simple to create data.table objects as well as to convert extant data.frames to data.tables--just call data.table on the data.frame you want to convert:

dt1 = data.table(my_dataframe)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文