如何快速将数据加载到R中?
我有一些 R 脚本,我必须尽快在 R 中加载多个数据帧。这非常重要,因为读取数据是该过程中最慢的部分。例如:从不同的数据帧进行绘图。我以 sav (SPSS) 格式获取数据,但我可以按照建议将其转换为任何格式。不幸的是,合并数据帧不是一个选择。
加载数据最快的方法是什么?我在考虑以下内容:
- 第一次从 sav 转换为二进制 R 对象 (Rdata),然后总是加载它,因为它看起来比 快得多代码>读取.spss。
- 从 sav 转换为 csv 文件,并从 中讨论的给定参数读取数据这个主题,
- 或者是否值得在本地主机上设置 MySQL 后端并从中加载数据?能不能更快一点?如果是这样,我还可以保存变量的任何自定义
attr
值(例如来自Spss导入文件的variable.labels)吗?或者这应该在单独的表中完成?
欢迎任何其他想法。提前感谢您的每一个建议!
我根据答案做了一个小实验下面您已经给出了,并且还添加了(24/01/2011)一个相当“hackish”但非常快速的解决方案,仅从特殊的二进制文件加载几个变量/列。后者似乎是我现在能想象到的最快的方法,这就是为什么我制作了(05/03/2011:ver.0.3)一个名为 saves 来处理此功能。该软件包正在“重度”开发中,欢迎任何推荐!
I have some R scripts, where I have to load several dataframe in R as quickly as possible. This is quite important as reading the data is the slowest part of the procedure. E.g.: plotting from different dataframes. I get the data in sav (SPSS) format, but I could transform it to any format as suggested. Merging the dataframes is not an option unfortunately.
What could be the fastest way to load the data? I was thinking of the following:
- transform from sav to binary R object (Rdata) in the first time, and later always load this, as it seems a lot quicker than
read.spss
. - transform from sav to csv files and reading data from those with given parameters discussed in this topic,
- or is it worth setting up a MySQL backend on localhost and load data from that? Could it be faster? If so, can I also save any custom
attr
values of the variables (e.g. variable.labels from Spss imported files)? Or this should be done in a separate table?
Any other thoughts are welcome. Thank you for every suggestion in advance!
I made a little experiment below based on the answers you have given, and also added (24/01/2011) a quite "hackish" but really speedy solution loading only a few variables/columns from a special binary file. The latter seems to be the fastest method I can imagine now, that is why I made up (05/03/2011: ver. 0.3) a small package named saves to deal with this feature. The package is under "heavy" development, any recommendation is welcome!
I will soon post a vignette with accurate benchmark results with the help of microbenchmark package.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
谢谢大家的提示和解答,我在此基础上做了一些总结和实验。
请参阅下面的公共数据库(匈牙利 ESS 2008)的小测试。该数据库有 1508 个案例和 508 个变量,因此它可能是一个中等规模的数据。这可能是(对我来说)进行测试的一个很好的例子,但当然特殊需求需要使用足够的数据进行实验。
不进行任何修改地从 SPSS sav 文件中读取数据:
使用转换后的二进制对象加载:
尝试使用 csv:
尝试使用“微调”csv > 加载:
还有包sqldf,它似乎加载csv文件要快得多:
并且还从本地主机上运行的MySQL数据库加载数据:
在这里,我认为我们应该添加两个
system.time
报告,因为在我们的例子中连接到数据也很重要。如果我误解了什么,请评论。但是让我们看看是否只查询某些变量,例如。在大多数情况下,绘图时我们不需要所有数据帧,并且仅查询两个变量就足以创建它们的漂亮绘图:
这看起来真的很棒!当然,在使用
dbReadTable
加载表格后摘要:没有什么比从二进制文件中读取整个数据更好的了,但只从数据库中读取几列(或其他过滤数据)在某些特殊情况下,相同数据库表也可能被加权。
测试环境:配备低端 SSD 的 HP 6715b 笔记本电脑(AMD X2 2Ghz、4 Gb DDR2)。
更新(2011年1月24日):我添加了一种相当黑客但相当“创意”的方式,仅加载二进制对象的几列 - 它看起来比上面检查的任何方法都要快得多。
请注意:代码看起来非常糟糕,但仍然非常有效:)
首先,我通过以下循环将 data.frame 的所有列保存到不同的二进制对象中:
数据:
然后加载两列 “超快”方法! :) 注意:它的加载速度比上面最快的(加载整个二进制对象)方法快 100 倍。
我制作了一个非常小的包(名为:saves),查看github 了解更多详细信息。
更新(06/03/2011):我的小包的新版本(saves)已上传到 CRAN,在其中可以更快地保存和加载变量 - 只要用户只需要数据框或列表中可用变量的子集。有关详细信息,请参阅包源中的 vignette 或 我的主页,让我介绍一下完成的一些基准测试的漂亮箱线图:
此箱线图显示了使用 saves 包仅加载变量子集的好处反对来自基地的
load
和read.table
或read.csv
,来自外国或的
或read.spss
sqldfRMySQL
包。Thank you all for the tips and answers, I did some summary and experiment based on that.
See a little test with a public database (ESS 2008 in Hungary) below. The database have 1508 cases and 508 variables, so it could be a mid-sized data. That might be a good example to do the test on (for me), but of course special needs would require an experiment with adequate data.
Reading the data from SPSS sav file without any modification:
Loading with a converted binary object:
Trying with csv:
Trying with "fine-tuned" csv loading:
Also with package sqldf, which seems to load csv files a lot faster:
And also loading the data from a MySQL database running on localhost:
Here, I think we should add the two
system.time
reported, as connecting to the data also counts in our case. Please comment, if I misunderstood something.But let us see if querying only some variables, as eg. while plotting we do not need all the dataframe in most cases, and querying only two variables is enough to create a nice plot of them:
Which seems really great! Of course just after loading the table with
dbReadTable
Summary: nothing to beat reading the whole data from binary file, but reading only a few columns (or other filtered data) from the same database table might be also weighted in some special cases.
Test environment: HP 6715b laptop (AMD X2 2Ghz, 4 Gb DDR2) with a low-end SSD.
UPDATE (24/01/2011): I added a rather hackish, but quite "creative" way of loading only a few columns of a binary object - which looks a lot faster then any method examined above.
Be aware: the code will look really bad, but still very effective :)
First, I save all columns of a data.frame into different binary objects via the following loop:
And then I load two columns of the data:
Which looks like a "superfast" method! :) Note: it was loaded 100 times faster than the fastest (loading the whole binary object) method above.
I have made up a very tiny package (named: saves), look in github for more details if interested.
UPDATE (06/03/2011): a new version of my little package (saves) was uploaded to CRAN, in which it is possible to save and load variables even faster - if only the user needs only a subset of the available variables in a data frame or list. See the vignette in the package sources for details or the one on my homepage, and let me introduce also a nice boxplot of some benchmark done:
This boxplot shows the benefit of using saves package to load only a subset of variables against
load
andread.table
orread.csv
from base,read.spss
from foreign orsqldf
orRMySQL
packages.这取决于您想要做什么以及如何进一步处理数据。无论如何,如果您始终需要相同的数据集,从二进制 R 对象加载总是会更快。这里的限制速度是硬盘的速度,而不是 R。二进制形式是工作区中数据帧的内部表示,因此不再需要转换。
任何类型的文本文件都是不同的情况,因为您总是会产生开销:每次读取文本文件时,数据都必须转换为二进制 R 对象。我会忘记他们。它们仅适用于将数据集从一个应用程序移植到另一个应用程序。
如果您需要数据的不同部分或不同组合的不同子集,设置 MySQL 后端非常有用。特别是在处理大型数据集时,您不必在开始选择行/列之前加载整个数据集,这一事实可以为您赢得相当多的时间。但这仅适用于巨大的数据集,因为读取二进制文件比搜索数据库要快得多。
如果数据不太大,您可以将不同的数据帧保存在一个 RData 文件中,从而使您有机会进一步简化操作。我经常在列表或单独的环境中拥有一组数据帧(另请参阅
?environment
了解一些简单的示例)。这允许lapply
/eapply
解决方案同时处理多个数据帧。It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore.
Any kind of text file is a different story, as you include invariably an overhead : each time you read in the text file, the data has to be transformed to the binary R object. I'd forget about them. They are only useful for porting datasets from one application to another.
Setting up a MySQL backend is very useful if you need different parts of the data, or different subsets in different combinations. Especially when working with huge datasets, the fact that you don't have to load in the whole dataset before you can start selecting the rows/columns, can gain you quite some time. But this only works with huge datasets, as reading a binary file is quite a bit faster than searching a database.
If the data is not too big, you can save different dataframes in one RData file, giving you the opportunity to streamline things a bit more. I often have a set of dataframes in a list or in a seperate environment (see also
?environment
for some simple examples). This allows forlapply
/eapply
solutions to process multiple dataframes at once.如果可能的话,将数据转换为 csv 或其他“简单”格式,以便尽可能快地读取(请参阅 Joris 的答案)。我使用
apply
函数导入csv
文件全体,大致如下:If it's at all possible, have the data transformed into a
csv
or other "simple" format to make reading as fast as possible (see Joris' answer). I importcsv
files en masse with theapply
function, something along the lines of:我对 RMySQL 非常满意。我不确定我是否以正确的方式回答了你的问题,但标签应该不是问题。有几个方便的函数只使用默认的 SQL 表和行名称,但当然您可以使用一些 SQL 语句。
我想说(除了证明这种努力是合理的大型数据集)使用 RMySQL 的主要原因之一是更熟悉 SQL 语法而不是 R 数据处理函数。就我个人而言,我更喜欢 GROUP BY 而不是聚合。请注意,从 R 内部使用存储过程的效果并不是特别好。
底线...设置 MySQL 本地主机并不需要太多努力 – 尝试一下!我无法准确说出速度,但我有一种感觉,它可能会更快。不过,我会尝试回到这里。
编辑:这是测试...获胜者是:spacedman
文件大小仅为约 1 MB。 MacBook Pro 4 GB 内存 2.4 GHZ 英特尔酷睿双核、Mac OSX 10.6.4、MySQL 5.0.41
只是从来没有尝试过,因为我通常使用更大的数据集,加载不是问题,而是处理......如果有时间问题的话。 Q+1!
I am pretty happy with RMySQL. I am not sure whether I got your question the right way, but labels should not be a problem. There are several convenience functions that just use the default SQL table and row names, but of course you can use some SQL statements.
I would say (apart from large datasets that justify the hustle) one of the main reasons to use RMySQL is being familiar more familiar with SQL syntax than with R data juggling functions. Personally I prefer GROUP BY over aggregate. Note, that using stored procedures from inside R does not work particularly well.
Bottom line... setting up an MySQL localhost is not too much effort – give it a try! I cannot tell exactly about the speed, but I have the feeling there's a chance it's faster. However, I will try and get back here.
EDIT: here's the test... and the winner is: spacedman
File Size was only about 1 MB here. MacBook Pro 4 GB Ram 2.4 GHZ Intel Core Duo, Mac OSX 10.6.4, MySQL 5.0.41
Just never tried that, because I work with bigger dataset usually and loading is not the issue, rather processing... if there are time issues at all. +1 for the Q!