快速读取非常大的表作为数据框

发布于 2024-08-10 16:39:04 字数 505 浏览 5 评论 0 原文

我有非常大的表(3000 万行),我想将其作为 R 中的数据帧加载。read.table() 有很多方便的功能,但似乎有很多实施中的逻辑会减慢速度。就我而言,我假设我提前知道列的类型,该表不包含任何列标题或行名称,并且没有任何我必须担心的病态字符。

我知道使用 scan() 将表作为列表读取可能会非常快,例如:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

但是我将其转换为数据帧的一些尝试似乎会降低上述性能一个因素6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

有更好的方法吗?或者很可能完全不同的解决问题的方法?

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

I know that reading in a table as a list using scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? Or quite possibly completely different approach to the problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

拒绝两难 2024-08-17 16:39:04

几年后的更新

这个答案已经过时了,R 已经继续前进。调整 read.table 运行一下更快的好处很少。您的选择是:

  1. 使用vroom 来自 tidyverse 包 vroom,用于将数据从 csv/制表符分隔文件直接导入到 R tibble 中。请参阅赫克托的回答


  2. fread "https://cran.r-project.org/web/packages/data.table/index.html" rel="noreferrer">data.table 用于从 csv 导入数据/制表符分隔的文件直接导入 R。请参阅 mnel 的回答

  3. read_table “https://cran.r-project.org/web/packages/readr/index.html”rel="noreferrer">readr(2015 年 4 月在 CRAN 上)。这与上面的 fread 非常相似。链接中的自述文件解释了两个函数之间的差异(readr 目前声称比 data.table::fread“慢 1.5-2 倍”代码>).

  4. read.csv.raw 来自 iotools 提供了第三个选项用于快速读取 CSV 文件。

  5. 尝试在数据库而不是平面文件中存储尽可能多的数据。 (除了作为更好的永久存储介质之外,数据以二进制格式传入和传出 R,速度更快。) sqldf 包,如 JD 中所述Long 的答案,将数据导入到临时 SQLite 数据库中,然后将其读入 R。另请参阅:RODBC 包,以及 DBI 页面。 MonetDB.R 给出您是一种假装是数据框架的数据类型,但实际上是底层的 MonetDB,从而提高了性能。使用其 monetdb.read.csv 函数。 dplyr 允许您直接工作数据存储在多种类型的数据库中。

  6. 以二进制格式存储数据也有助于提高性能。使用 saveRDS/readRDS (见下文),h5rhdf5 HDF5 格式的软件包,或 fst 包。


原始答案

无论您使用 read.table 还是 scan,都有一些简单的事情可以尝试。

  1. 设置nrows=数据中的记录数扫描中的nmax)。

  2. 确保 comment.char="" 关闭注释解释。

  3. 使用 read.table 中的 colClasses 显式定义每列的类。

  4. 设置 multi.line=FALSE 还可以提高扫描性能。

如果这些方法都不起作用,请使用分析包之一来确定哪些线路正在减慢速度。也许您可以根据结果编写 read.table 的精简版本。

另一种选择是在将数据读入 R 之前对其进行过滤。

或者,如果问题是您必须定期读取数据,则使用这些方法一次性读取数据,然后将数据帧保存为二进制 blob 保存 saveRDS,那么下次您可以使用 < 更快地检索它del>加载 读取RDS。

An update, several years later

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

  1. Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.

  2. Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.

  3. Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

  4. read.csv.raw from iotools provides a third option for quickly reading CSV files.

  5. Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

  6. Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.


The original answer

There are a couple of simple things to try, whether you use read.table or scan.

  1. Set nrows=the number of records in your data (nmax in scan).

  2. Make sure that comment.char="" to turn off interpretation of comments.

  3. Explicitly define the classes of each column using colClasses in read.table.

  4. Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

清君侧 2024-08-17 16:39:04

这是一个使用 data.table 1.8.7 中的 fread 的示例

这些示例来自 fread 的帮助页面,时间安排在我的Windows XP Core 2 双核 E8400。

library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
                 b=sample(1:1000,n,replace=TRUE),
                 c=rnorm(n),
                 d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                 e=rnorm(n),
                 f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]

标准 read.table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
## File size (MB): 51 

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   24.71    0.15   25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   17.85    0.07   17.98

优化 read.table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                          stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                          colClasses=c("integer","integer","numeric",                        
                                       "character","numeric","integer")))


##    user  system elapsed 
##   10.20    0.03   10.32

fread

require(data.table)
system.time(DT <- fread("test.csv"))                                  
 ##    user  system elapsed 
##    3.12    0.01    3.22

sqldf

require(sqldf)

system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             

##    user  system elapsed 
##   12.49    0.09   12.69

# sqldf as on SO

f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

##    user  system elapsed 
##   10.21    0.47   10.73

ff / ffdf

 require(ff)

 system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
 ##    user  system elapsed 
 ##   10.85    0.10   10.99

总结:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

Here is an example that utilizes fread from data.table 1.8.7

The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.

library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
                 b=sample(1:1000,n,replace=TRUE),
                 c=rnorm(n),
                 d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                 e=rnorm(n),
                 f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]

standard read.table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
## File size (MB): 51 

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   24.71    0.15   25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   17.85    0.07   17.98

optimized read.table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                          stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                          colClasses=c("integer","integer","numeric",                        
                                       "character","numeric","integer")))


##    user  system elapsed 
##   10.20    0.03   10.32

fread

require(data.table)
system.time(DT <- fread("test.csv"))                                  
 ##    user  system elapsed 
##    3.12    0.01    3.22

sqldf

require(sqldf)

system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             

##    user  system elapsed 
##   12.49    0.09   12.69

# sqldf as on SO

f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

##    user  system elapsed 
##   10.21    0.47   10.73

ff / ffdf

 require(ff)

 system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
 ##    user  system elapsed 
 ##   10.85    0.10   10.99

In summary:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf
天邊彩虹 2024-08-17 16:39:04

我最初没有看到这个问题,几天后又问了一个类似的问题。我将记下之前的问题,但我想我应该在此处添加一个答案来解释我如何使用 sqldf() 来执行此操作。

关于最佳方式,已经有一些讨论将 2GB 或更多文本数据导入到 R 数据框中。昨天我写了一篇关于使用 sqldf() 导入的 博客文章将数据放入 SQLite 作为暂存区域,然后将其从 SQLite 吸入 R。这对我来说非常有效。我能够在 << 中提取 2GB(3 列,40 毫米行)的数据。 5分钟。相比之下,read.csv 命令运行了一整晚,但从未完成。

这是我的测试代码:

设置测试数据:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)

在运行以下导入例程之前我重新启动了 R:

library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

我让以下行运行了一整晚,但它从未完成:

system.time(big.df <- read.csv('bigdf.csv'))

I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.

There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

Here's my test code:

Set up the test data:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)

I restarted R before running the following import routine:

library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

I let the following line run all night but it never completed:

system.time(big.df <- read.csv('bigdf.csv'))
等待我真够勒 2024-08-17 16:39:04

奇怪的是,多年来没有人回答问题的底部部分,尽管这是一个重要的问题 - data.frame 只是具有正确属性的列表,因此如果您有大量数据,则不需要不想使用 as.data.frame 或类似的列表。简单地将列表“转换”为数据框就地要快得多:

attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"

这不会复制数据,因此它是直接的(与所有其他方法不同)。它假设您已经在列表上相应地设置了 names()

[至于将大数据加载到 R 中——就我个人而言,我将它们按列转储到二进制文件中并使用 readBin() ——这是迄今为止最快的方法(除了 mmapping 之外),并且仅受以下限制:磁盘速度。与二进制数据相比,解析 ASCII 文件本身就很慢(即使在 C 语言中也是如此)。]

Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:

attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"

This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.

[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]

挥剑断情 2024-08-17 16:39:04

这以前是 R-Help 上询问,因此值得回顾。

一项建议是使用 readChar(),然后使用 strsplit()substr() 对结果进行字符串操作。可以看到readChar涉及的逻辑比read.table少很多。

我不知道内存是否是一个问题,但您可能也想要查看 HadoopStreaming使用 Hadoop,这是一个专为处理大型数据集而设计的 MapReduce 框架。为此,您可以使用 hsTableReader 函数。这是一个示例(但它有一个学习 Hadoop 的学习曲线):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)

这里的基本思想是将数据导入分成块。您甚至可以使用并行框架之一(例如雪)并通过对文件进行分段来并行运行数据导入,但很可能对于大型数据集没有帮助,因为您将遇到内存限制,这就是为什么 Map-Reduce 是更好的方法。

This was previously asked on R-Help, so that's worth reviewing.

One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)

The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

赤濁 2024-08-17 16:39:04

另一种方法是使用 vroom 包。现在在 CRAN 上。
vroom 不会加载整个文件,它会索引每个记录所在的位置,并在您稍后使用它时读取。

只需为您使用的内容付费。

请参阅vroom 简介vroom 入门vroom 基准

基本概述是,初始读取大文件会快得多,而后续对数据的修改可能会稍微慢一些。因此,根据您的用途,它可能是最好的选择。

请参阅下面的 vroom 基准 中的简化示例,关键部分看到的是超快的读取时间,但稍微减少了聚合等操作。

package                 read    print   sample   filter  aggregate   total
read.delim              1m      21.5s   1ms      315ms   764ms       1m 22.6s
readr                   33.1s   90ms    2ms      202ms   825ms       34.2s
data.table              15.7s   13ms    1ms      129ms   394ms       16.3s
vroom (altrep) dplyr    1.7s    89ms    1.7s     1.3s    1.9s        6.7s

An alternative is to use the vroom package. Now on CRAN.
vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.

Only pay for what you use.

See Introduction to vroom, Get started with vroom and the vroom benchmarks.

The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.

See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..

package                 read    print   sample   filter  aggregate   total
read.delim              1m      21.5s   1ms      315ms   764ms       1m 22.6s
readr                   33.1s   90ms    2ms      202ms   825ms       34.2s
data.table              15.7s   13ms    1ms      129ms   394ms       16.3s
vroom (altrep) dplyr    1.7s    89ms    1.7s     1.3s    1.9s        6.7s
独留℉清风醉 2024-08-17 16:39:04

我使用新的 arrow 包非常快速地读取数据。它似乎处于相当早期的阶段。

具体来说,我使用的是 parquet 柱状格式。这会转换回 R 中的 data.frame,但如果不这样做,您可以获得更大的加速。这种格式很方便,因为它也可以在 Python 中使用。

我的主要用例是在相当受限的 RShiny 服务器上。由于这些原因,我更喜欢将数据附加到应用程序(即,在 SQL 之外),因此需要较小的文件大小和速度。

这篇链接的文章提供了基准测试和良好的概述。我在下面引用了一些有趣的观点。

https://ursalabs.org/blog/2019-10-columnar-perf/

文件大小

也就是说,Parquet 文件的大小甚至是 gzip 压缩的 CSV 的一半。 Parquet 文件如此小的原因之一是字典编码(也称为“字典压缩”)。与使用 LZ4 或 ZSTD(用于 FST 格式)等通用字节压缩器相比,字典压缩可以产生更好的压缩效果。 Parquet 旨在生成可快速读取的非常小的文件。

读取速度

当通过输出类型进行控制时(例如,将所有 R data.frame 输出相互比较),我们发现 Parquet、Feather 和 FST 的性能彼此之间的差距相对较小。 pandas.DataFrame 输出也是如此。 data.table::fread 与 1.5 GB 文件大小相比具有令人印象深刻的竞争力,但在 2.5 GB CSV 上落后于其他文件。


独立测试

我对 1,000,000 行的模拟数据集执行了一些独立基准测试。基本上我洗牌了一堆东西来尝试挑战压缩。我还添加了一个由随机单词和两个模拟因素组成的短文本字段。

数据

library(dplyr)
library(tibble)
library(OpenRepGrid)

n <- 1000000

set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))


test_data <- mtcars %>%
  rownames_to_column() %>%
  sample_n(n, replace = TRUE) %>%
  mutate_all(~ sample(., length(.))) %>%
  mutate(factor1 = sample(some_levels1, n, replace = TRUE),
         factor2 = sample(some_levels2, n, replace = TRUE),
         text = randomSentences(n, sample(3:8, n, replace = TRUE))
         )

读写 写入

数据很容易。

library(arrow)

write_parquet(test_data , "test_data.parquet")

# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)

读取数据也很容易。

read_parquet("test_data.parquet")

# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)

我测试了根据一些竞争选项读取这些数据,并且确实得到了与上面的文章略有不同的结果,这是预期的。

benchmarking

这个文件远没有基准文章那么大,所以也许这就是差异。

测试

  • rds: test_data.rds(20.3 MB)
  • parquet2_native:(14.9 MB,更高压缩率和as_data_frame = FALSE
  • parquet2:< /strong> test_data2.parquet(14.9 MB,更高压缩)
  • parquet: test_data.parquet(40.7 MB)
  • fst2: test_data2.fst(27.9 MB,更高压缩)
  • fst: test_data.fst (76.8 MB)
  • fread2: test_data.csv.gz (23.6MB)
  • fread: test_data.csv (98.7MB)
  • feather_arrow: test_data.feather(使用arrow读取157.2 MB)
  • feather: test_data.feather(使用feather读取157.2 MB)

观察

对于这个特定的文件,fread 实际上非常快。我喜欢高度压缩的 parquet2 测试中的小文件。如果我确实需要加速,我可能会花时间使用本机数据格式而不是 data.frame

这里fst也是一个不错的选择。我会使用高度压缩的 fst 格式或高度压缩的 parquet ,具体取决于我是否需要速度或文件大小权衡。

I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.

Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.

My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.

This linked article provides benchmarking and a good overview. I have quoted some interesting points below.

https://ursalabs.org/blog/2019-10-columnar-perf/

File Size

That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.

Read Speed

When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.


Independent Test

I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.

Data

library(dplyr)
library(tibble)
library(OpenRepGrid)

n <- 1000000

set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))


test_data <- mtcars %>%
  rownames_to_column() %>%
  sample_n(n, replace = TRUE) %>%
  mutate_all(~ sample(., length(.))) %>%
  mutate(factor1 = sample(some_levels1, n, replace = TRUE),
         factor2 = sample(some_levels2, n, replace = TRUE),
         text = randomSentences(n, sample(3:8, n, replace = TRUE))
         )

Read and Write

Writing the data is easy.

library(arrow)

write_parquet(test_data , "test_data.parquet")

# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)

Reading the data is also easy.

read_parquet("test_data.parquet")

# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)

I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.

benchmarking

This file is nowhere near as large as the benchmark article, so maybe that is the difference.

Tests

  • rds: test_data.rds (20.3 MB)
  • parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
  • parquet2: test_data2.parquet (14.9 MB with higher compression)
  • parquet: test_data.parquet (40.7 MB)
  • fst2: test_data2.fst (27.9 MB with higher compression)
  • fst: test_data.fst (76.8 MB)
  • fread2: test_data.csv.gz (23.6MB)
  • fread: test_data.csv (98.7MB)
  • feather_arrow: test_data.feather (157.2 MB read with arrow)
  • feather: test_data.feather (157.2 MB read with feather)

Observations

For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.

Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.

耳钉梦 2024-08-17 16:39:04

还有一点值得一提。如果您有一个非常大的文件,您可以使用(其中 bedGraph 是工作目录中文件的名称)即时计算行数(如果没有标题):

>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))

然后您可以使用在 read.csvread.table ...

>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
   user  system elapsed 
 25.877   0.887  26.752 
>object.size(BG)
203949432 bytes

A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):

>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))

You can then use that either in read.csv , read.table ...

>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
   user  system elapsed 
 25.877   0.887  26.752 
>object.size(BG)
203949432 bytes
踏月而来 2024-08-17 16:39:04

很多时候,我认为将较大的数据库保留在一个数据库中(例如 Postgres)是一个很好的做法。我不会使用比 (nrow * ncol) ncell = 10M 更大的任何东西,这非常小;但我经常发现我只希望 R 仅在从多个数据库查询时创建和保存内存密集型图。在未来的 32 GB 笔记本电脑中,其中一些类型的内存问题将消失。但是,使用数据库来保存数据,然后使用 R 的内存来存储结果查询结果和图表的诱惑仍然可能有用。一些优点是:

(1) 数据保持加载在数据库中。当您重新打开笔记本电脑时,您只需在 pgadmin 中重新连接到所需的数据库即可。

(2) 确实,R 可以比 SQL 执行更多漂亮的统计和图形操作。但我认为 SQL 比 R 更适合查询大量数据。

# Looking at Voter/Registrant Age by Decade

library(RPostgreSQL);library(lattice)

con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                 port="2345", host="localhost", dbname="WC2014_08_01_2014")

Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)

with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)

Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:

(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.

(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.

# Looking at Voter/Registrant Age by Decade

library(RPostgreSQL);library(lattice)

con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                 port="2345", host="localhost", dbname="WC2014_08_01_2014")

Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)

with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)
那支青花 2024-08-17 16:39:04

我想以最简单的形式贡献基于 Spark 的解决方案:

# Test Data ---------------------------------------------------------------

set.seed(123)
bigdf <-
    data.frame(
        dim = sample(letters, replace = T, 4e7),
        fact1 = rnorm(4e7),
        fact2 = rnorm(4e7, 20, 50)
    )
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)

# Spark -------------------------------------------------------------------

# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()

library("sparklyr")
sc <- spark_connect(master = "local")

# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))

Spark 生成了相当不错的结果:

>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
   user  system elapsed 
  0.278   0.034  11.747 

这是在具有 32GB 内存的 MacBook Pro 上进行测试的。

备注

Spark,通常不应该能够“战胜”速度优化的软件包。尽管如此,我想使用 Spark 提供一个答案:

  • 对于一些流程无法正常工作的评论和答案,使用 Spark 可能是一个可行的替代方案
  • 从长远来看,将尽可能多的数据锤击到 data.frame 中 稍后可能会出现问题,当对该对象尝试其他操作并达到架构的性能范围时,

我认为对于类似的问题,如果任务是处理 1e7 或更多行 Spark,则应该考虑。即使可以将这些数据“锤入”到单个 data.frame 中,但感觉还是不对。该对象可能难以使用,并且在部署模型等时会产生问题。

I wanted to contribute Spark-based solution in the simplest form:

# Test Data ---------------------------------------------------------------

set.seed(123)
bigdf <-
    data.frame(
        dim = sample(letters, replace = T, 4e7),
        fact1 = rnorm(4e7),
        fact2 = rnorm(4e7, 20, 50)
    )
tmp_csv <- fs::file_temp(pattern = "big_df", ext = ".csv")
readr::write_csv(x = bigdf, file = tmp_csv)

# Spark -------------------------------------------------------------------

# Installing if needed
# sparklyr::spark_available_versions()
# sparklyr::spark_install()

library("sparklyr")
sc <- spark_connect(master = "local")

# Uploading CSV
system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))

Spark generated fairly OK results:

>> system.time(tbl_big_df <- spark_read_csv(sc = sc, path = tmp_csv))
   user  system elapsed 
  0.278   0.034  11.747 

This was tested on MacBook Pro with 32GB ram.

Remarks

Spark, usually shouldn't be able to "win" against packages optimised for speed. Nevertheless, I wanted to contribute an answer using Spark:

  • For some of the comments and answers where process didn't work using Spark may be a viable alternative
  • In a long-run, hammering as much data as possible into data.frame may prove problematic later on, when other operations are attempted on that object and hit the performance envelope of architecture

I think that for questions like that, where the task is to handle 1e7 or more rows Spark should be given considerations. Even if it may be possible to "hammer in" that data into a single data.frame it's just doesn't feel right. Likely that object will be difficult to work with and create problems when deploying models, etc.

李不 2024-08-17 16:39:04

我觉得 fread 是一个更快的函数,而不是传统的 read.table。
指定其他属性(例如仅选择所需的列、指定 colclasses 和字符串作为因素)将减少导入文件所需的时间。

data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))

Instead of the conventional read.table I feel fread is a faster function.
Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.

data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))
池予 2024-08-17 16:39:04

我已经尝试了以上所有方法,[readr][1] 做得最好。我只有 8GB RAM

Loop 用于 20 个文件,每个文件 5GB,7 列:

read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

I've tried all above and [readr][1] made the best job. I have only 8gb RAM

Loop for 20 files, 5gb each, 7 columns:

read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文