如何加速这个 R 代码
我有一个 data.frame(文件链接)有 18 列和 11520 行,我像这样转换:
library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle),
numcolwise(median), na.rm=TRUE)
根据 system.time(),运行大约需要这么长时间:
user system elapsed
5.16 0.00 5.17
此调用是 web 应用程序的一部分,因此运行时间非常重要。有没有办法加快这个调用的速度?
I have a data.frame (link to file) with 18 columns and 11520 rows that I transform like this:
library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle),
numcolwise(median), na.rm=TRUE)
according to system.time(), it takes about this long to run:
user system elapsed
5.16 0.00 5.17
This call is part of a webapp, so run time is pretty important. Is there a way to speed this call up?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
仅使用
aggregate
会快很多......Just using
aggregate
is quite a bit faster...只是总结一下评论中的一些观点:
plyr
的设计主要是为了易于使用,而不是为了性能(尽管最近的版本有一些不错的性能改进)。一些基本函数速度更快,因为它们的开销更少。 @JDLong 指出一个很好的线程涵盖了其中一些问题,包括一些哈德利的专业技术。Just to summarize some of the points from the comments:
plyr
is designed primarily for ease-of-use, not for performance (although the recent version had some nice performance improvements). Some of the base functions are faster because they have less overhead. @JDLong pointed to a nice thread that covers some of these issues, including some specialized techniques from Hadley.计算中位数时,数据的顺序很重要:如果数据按从小到大的顺序排列,那么计算速度会更快一些。
对于新数据集,导入时按适当的列对数据进行排序。对于现有数据集,您可以将它们作为批处理作业进行排序(在 Web 应用程序之外)。
The order of the data matter when you are calculating medians: if the data are in order from smallest to largest, then the calculation is a bit quicker.
For the new datasets, sort the data by an appropriate column when you import it. For existing datasets you can sort them as a batch job (outside the web app).
添加到约书亚的解决方案中。如果您决定使用平均值而不是中位数,则可以将计算速度再加快 4 倍:
To add to Joshua's solution. If you decide to use mean instead of median, you can speed up the computation another 4 times:
使用 dplyr 处理此数据要快得多:(
您需要 dplyr 0.2 才能获取
%>%
和summarise_each
)这与 plyr 相比是有利的:
并且与
aggregate()
(代码来自@joshua-ulrich)Working with this data is considerably faster with dplyr:
(You'll need dplyr 0.2 to get
%>%
andsummarise_each
)This compares favourable to plyr:
And to
aggregate()
(code from @joshua-ulrich)好吧,我只是使用标准库函数(例如“table”、“tapply”、“aggregate”等)和类似的 plyr 对大型数据框(plyr 包中的棒球数据集)进行了一些简单的转换函数——在每个实例中,我发现 plyr 明显慢一些。例如,
第二,我没有调查这是否会提高您的情况下的性能,但对于您现在正在使用的大小和更大的数据帧,我使用data.table 库,可在 CRAN 上使用。创建 data.table 对象以及将现有 data.frames 转换为 data.tables 都很简单——只需在要转换的 data.frame 上调用 data.table 即可:
Well i just did a few simple transformations on a large data frame (the baseball data set in the plyr package) using the standard library functions (e.g., 'table', 'tapply', 'aggregate', etc.) and the analogous plyr function--in each instance, i found plyr to be significantly slower. E.g.,
Second, and i did not investigate whether this would improve performance in your case, but for data frames of the size you are working with now and larger, i use the data.table library, available on CRAN. It is simple to create data.table objects as well as to convert extant data.frames to data.tables--just call data.table on the data.frame you want to convert: