如何快速将数据加载到R中？

发布于 2024-10-13 07:48:19 字数 1135 浏览 11 评论 0原文

我有一些 R 脚本，我必须尽快在 R 中加载多个数据帧。这非常重要，因为读取数据是该过程中最慢的部分。例如：从不同的数据帧进行绘图。我以 sav (SPSS) 格式获取数据，但我可以按照建议将其转换为任何格式。不幸的是，合并数据帧不是一个选择。

加载数据最快的方法是什么？我在考虑以下内容：

第一次从 sav 转换为二进制 R 对象 (Rdata)，然后总是加载它，因为它看起来比 快得多代码>读取.spss。
从 sav 转换为 csv 文件，并从中讨论的给定参数读取数据这个主题，
或者是否值得在本地主机上设置 MySQL 后端并从中加载数据？能不能更快一点？如果是这样，我还可以保存变量的任何自定义attr值（例如来自Spss导入文件的variable.labels）吗？或者这应该在单独的表中完成？

欢迎任何其他想法。提前感谢您的每一个建议！

我根据答案做了一个小实验下面您已经给出了，并且还添加了（24/01/2011）一个相当“hackish”但非常快速的解决方案，仅从特殊的二进制文件加载几个变量/列。后者似乎是我现在能想象到的最快的方法，这就是为什么我制作了（05/03/2011：ver.0.3）一个名为 saves 来处理此功能。该软件包正在“重度”开发中，欢迎任何推荐！

我很快将在 microbenchmark< 的帮助下发布包含准确基准测试结果的小插图/a> 包。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷路的信 2024-10-20 07:48:19

谢谢大家的提示和解答，我在此基础上做了一些总结和实验。

请参阅下面的公共数据库（匈牙利 ESS 2008）的小测试。该数据库有 1508 个案例和 508 个变量，因此它可能是一个中等规模的数据。这可能是（对我来说）进行测试的一个很好的例子，但当然特殊需求需要使用足够的数据进行实验。

不进行任何修改地从 SPSS sav 文件中读取数据：

> system.time(data <- read.spss('ESS_HUN_4.sav'))
   user  system elapsed 
  2.214   0.030   2.376

使用转换后的二进制对象加载：

> save('data',file='ESS_HUN_4.Rdata')
> system.time(data.Rdata <- load('ESS_HUN_4.Rdata'))
   user  system elapsed 
   0.28    0.00    0.28

尝试使用 csv：

> write.table(data, file="ESS_HUN_4.csv")
> system.time(data.csv <- read.csv('ESS_HUN_4.csv'))
   user  system elapsed 
  1.730   0.010   1.824

尝试使用“微调”csv > 加载：

> system.time(data.csv <- read.table('ESS_HUN_4.csv', comment.char="", stringsAsFactors=FALSE, sep=","))
   user  system elapsed 
  1.296   0.014   1.362

还有包sqldf，它似乎加载csv文件要快得多：

> library(sqldf)
> f <- file("ESS_HUN_4.csv")
>  system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep="\t")))
   user  system elapsed 
  0.939   0.106   1.071

并且还从本地主机上运行的MySQL数据库加载数据：

> library(RMySQL) 
> con <- dbConnect(MySQL(), user='root', dbname='test', host='localhost', password='')
> dbWriteTable(con, "data", as.data.frame(data), overwrite = TRUE)
> system.time(data <- dbReadTable(con, 'data'))
   user  system elapsed 
  0.583   0.026   1.055 
> query <-('SELECT * FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
   user  system elapsed 
  0.270   0.020   0.473

在这里，我认为我们应该添加两个system.time 报告，因为在我们的例子中连接到数据也很重要。如果我误解了什么，请评论。

但是让我们看看是否只查询某些变量，例如。在大多数情况下，绘图时我们不需要所有数据帧，并且仅查询两个变量就足以创建它们的漂亮绘图：

> query <-('SELECT c1, c19 FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
   user  system elapsed 
  0.030   0.000   0.112

这看起来真的很棒！当然，在使用dbReadTable加载表格后

摘要：没有什么比从二进制文件中读取整个数据更好的了，但只从数据库中读取几列（或其他过滤数据）在某些特殊情况下，相同数据库表也可能被加权。

测试环境：配备低端 SSD 的 HP 6715b 笔记本电脑（AMD X2 2Ghz、4 Gb DDR2）。

更新（2011年1月24日）：我添加了一种相当黑客但相当“创意”的方式，仅加载二进制对象的几列 - 它看起来比上面检查的任何方法都要快得多。

请注意：代码看起来非常糟糕，但仍然非常有效:)

首先，我通过以下循环将 data.frame 的所有列保存到不同的二进制对象中：

attach(data)
for (i in 1:length(data)) {
    save(list=names(data)[i],file=paste('ESS_HUN_4-', names(data)[i], '.Rdata', sep=''))
}
detach(data)

数据：

> system.time(load('ESS_HUN_4-c19.Rdata')) + 
>     system.time(load('ESS_HUN_4-c1.Rdata')) + 
>     system.time(data.c1_c19 <- cbind(c1, c19))
    user  system elapsed 
    0.003   0.000   0.002

然后加载两列 “超快”方法！ :) 注意：它的加载速度比上面最快的（加载整个二进制对象）方法快 100 倍。

我制作了一个非常小的包（名为：saves），查看github 了解更多详细信息。

更新（06/03/2011）：我的小包的新版本（saves）已上传到 CRAN，在其中可以更快地保存和加载变量 - 只要用户只需要数据框或列表中可用变量的子集。有关详细信息，请参阅包源中的 vignette 或我的主页，让我介绍一下完成的一些基准测试的漂亮箱线图：

比较不同数据帧/列表加载机制的速度

此箱线图显示了使用 saves 包仅加载变量子集的好处反对来自基地的 load 和 read.table 或 read.csv，来自外国或 的 read.spss sqldf 或 RMySQL 包。

Thank you all for the tips and answers, I did some summary and experiment based on that.

See a little test with a public database (ESS 2008 in Hungary) below. The database have 1508 cases and 508 variables, so it could be a mid-sized data. That might be a good example to do the test on (for me), but of course special needs would require an experiment with adequate data.

Reading the data from SPSS sav file without any modification:

> system.time(data <- read.spss('ESS_HUN_4.sav'))
   user  system elapsed 
  2.214   0.030   2.376

Loading with a converted binary object:

> save('data',file='ESS_HUN_4.Rdata')
> system.time(data.Rdata <- load('ESS_HUN_4.Rdata'))
   user  system elapsed 
   0.28    0.00    0.28

Trying with csv:

> write.table(data, file="ESS_HUN_4.csv")
> system.time(data.csv <- read.csv('ESS_HUN_4.csv'))
   user  system elapsed 
  1.730   0.010   1.824

Trying with "fine-tuned" csv loading:

> system.time(data.csv <- read.table('ESS_HUN_4.csv', comment.char="", stringsAsFactors=FALSE, sep=","))
   user  system elapsed 
  1.296   0.014   1.362

Also with package sqldf, which seems to load csv files a lot faster:

> library(sqldf)
> f <- file("ESS_HUN_4.csv")
>  system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep="\t")))
   user  system elapsed 
  0.939   0.106   1.071

And also loading the data from a MySQL database running on localhost:

> library(RMySQL) 
> con <- dbConnect(MySQL(), user='root', dbname='test', host='localhost', password='')
> dbWriteTable(con, "data", as.data.frame(data), overwrite = TRUE)
> system.time(data <- dbReadTable(con, 'data'))
   user  system elapsed 
  0.583   0.026   1.055 
> query <-('SELECT * FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
   user  system elapsed 
  0.270   0.020   0.473

Here, I think we should add the two system.time reported, as connecting to the data also counts in our case. Please comment, if I misunderstood something.

But let us see if querying only some variables, as eg. while plotting we do not need all the dataframe in most cases, and querying only two variables is enough to create a nice plot of them:

> query <-('SELECT c1, c19 FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
   user  system elapsed 
  0.030   0.000   0.112

Which seems really great! Of course just after loading the table with dbReadTable

Summary: nothing to beat reading the whole data from binary file, but reading only a few columns (or other filtered data) from the same database table might be also weighted in some special cases.

Test environment: HP 6715b laptop (AMD X2 2Ghz, 4 Gb DDR2) with a low-end SSD.

UPDATE (24/01/2011): I added a rather hackish, but quite "creative" way of loading only a few columns of a binary object - which looks a lot faster then any method examined above.

Be aware: the code will look really bad, but still very effective :)

First, I save all columns of a data.frame into different binary objects via the following loop:

attach(data)
for (i in 1:length(data)) {
    save(list=names(data)[i],file=paste('ESS_HUN_4-', names(data)[i], '.Rdata', sep=''))
}
detach(data)

And then I load two columns of the data:

> system.time(load('ESS_HUN_4-c19.Rdata')) + 
>     system.time(load('ESS_HUN_4-c1.Rdata')) + 
>     system.time(data.c1_c19 <- cbind(c1, c19))
    user  system elapsed 
    0.003   0.000   0.002

Which looks like a "superfast" method! :) Note: it was loaded 100 times faster than the fastest (loading the whole binary object) method above.

I have made up a very tiny package (named: saves), look in github for more details if interested.

UPDATE (06/03/2011): a new version of my little package (saves) was uploaded to CRAN, in which it is possible to save and load variables even faster - if only the user needs only a subset of the available variables in a data frame or list. See the vignette in the package sources for details or the one on my homepage, and let me introduce also a nice boxplot of some benchmark done:

Comparison of different data frame/list loading mechanism by speed

This boxplot shows the benefit of using saves package to load only a subset of variables against load and read.table or read.csv from base, read.spss from foreign or sqldf or RMySQL packages.

回复收藏 0 原文

等风来 2024-10-20 07:48:19

这取决于您想要做什么以及如何进一步处理数据。无论如何，如果您始终需要相同的数据集，从二进制 R 对象加载总是会更快。这里的限制速度是硬盘的速度，而不是 R。二进制形式是工作区中数据帧的内部表示，因此不再需要转换。

任何类型的文本文件都是不同的情况，因为您总是会产生开销：每次读取文本文件时，数据都必须转换为二进制 R 对象。我会忘记他们。它们仅适用于将数据集从一个应用程序移植到另一个应用程序。

如果您需要数据的不同部分或不同组合的不同子集，设置 MySQL 后端非常有用。特别是在处理大型数据集时，您不必在开始选择行/列之前加载整个数据集，这一事实可以为您赢得相当多的时间。但这仅适用于巨大的数据集，因为读取二进制文件比搜索数据库要快得多。

如果数据不太大，您可以将不同的数据帧保存在一个 RData 文件中，从而使您有机会进一步简化操作。我经常在列表或单独的环境中拥有一组数据帧（另请参阅 ?environment 了解一些简单的示例）。这允许 lapply / eapply 解决方案同时处理多个数据帧。

回复收藏 0 原文

阳光①夏 2024-10-20 07:48:19

如果可能的话，将数据转换为 csv 或其他“简单”格式，以便尽可能快地读取（请参阅 Joris 的答案）。我使用 apply 函数导入 csv 文件全体，大致如下：

list.of.files <- as.list(list.files("your dir"))
lapply(list.of.files, FUN = function(x) {
    my.object <- read.table(...) # or some other function, like read.spss
})

If it's at all possible, have the data transformed into a csv or other "simple" format to make reading as fast as possible (see Joris' answer). I import csv files en masse with the apply function, something along the lines of:

list.of.files <- as.list(list.files("your dir"))
lapply(list.of.files, FUN = function(x) {
    my.object <- read.table(...) # or some other function, like read.spss
})

回复收藏 0 原文

北恋 2024-10-20 07:48:19

我对 RMySQL 非常满意。我不确定我是否以正确的方式回答了你的问题，但标签应该不是问题。有几个方便的函数只使用默认的 SQL 表和行名称，但当然您可以使用一些 SQL 语句。

我想说（除了证明这种努力是合理的大型数据集）使用 RMySQL 的主要原因之一是更熟悉 SQL 语法而不是 R 数据处理函数。就我个人而言，我更喜欢 GROUP BY 而不是聚合。请注意，从 R 内部使用存储过程的效果并不是特别好。

底线...设置 MySQL 本地主机并不需要太多努力 – 尝试一下！我无法准确说出速度，但我有一种感觉，它可能会更快。不过，我会尝试回到这里。

编辑：这是测试...获胜者是：spacedman

# SQL connection
source("lib/connect.R")

dbQuery <- "SELECT * FROM mytable"
mydata <- dbGetQuery(con,dbQuery)
system.time(dbGetQuery(con,dbQuery))
# returns
#user  system elapsed 
# 0.999   0.213   1.715 

save.image(file="speedtest.Rdata")
system.time(load("speedtest.Rdata"))
#user  system elapsed 
#0.348   0.006   0.358

文件大小仅为约 1 MB。 MacBook Pro 4 GB 内存 2.4 GHZ 英特尔酷睿双核、Mac OSX 10.6.4、MySQL 5.0.41
只是从来没有尝试过，因为我通常使用更大的数据集，加载不是问题，而是处理......如果有时间问题的话。 Q+1！

I am pretty happy with RMySQL. I am not sure whether I got your question the right way, but labels should not be a problem. There are several convenience functions that just use the default SQL table and row names, but of course you can use some SQL statements.

I would say (apart from large datasets that justify the hustle) one of the main reasons to use RMySQL is being familiar more familiar with SQL syntax than with R data juggling functions. Personally I prefer GROUP BY over aggregate. Note, that using stored procedures from inside R does not work particularly well.

Bottom line... setting up an MySQL localhost is not too much effort – give it a try! I cannot tell exactly about the speed, but I have the feeling there's a chance it's faster. However, I will try and get back here.

EDIT: here's the test... and the winner is: spacedman

# SQL connection
source("lib/connect.R")

dbQuery <- "SELECT * FROM mytable"
mydata <- dbGetQuery(con,dbQuery)
system.time(dbGetQuery(con,dbQuery))
# returns
#user  system elapsed 
# 0.999   0.213   1.715 

save.image(file="speedtest.Rdata")
system.time(load("speedtest.Rdata"))
#user  system elapsed 
#0.348   0.006   0.358

File Size was only about 1 MB here. MacBook Pro 4 GB Ram 2.4 GHZ Intel Core Duo, Mac OSX 10.6.4, MySQL 5.0.41
Just never tried that, because I work with bigger dataset usually and loading is not the issue, rather processing... if there are time issues at all. +1 for the Q!

回复收藏 0 原文

~没有更多了~