在我的机器上,从文本文件读取 ~5x10^6 数值到 R 中相对较慢(几秒钟,我读了几个这样的文件),即使使用 scan(..., What="numeric", nmax =5000)
或类似的技巧。是否值得尝试使用 Rcpp
包装器来完成此类任务(例如,Armadillo
有一些用于读取文本文件的实用程序)?
或者,由于预期的接口开销,我可能会浪费时间而几乎没有提高性能吗?我不确定目前是什么限制了速度(机器固有性能,还是其他?)这是一项我通常每天重复多次的任务,并且文件格式始终相同,1000 列,大约 5000 行。
如果需要的话,这是一个可以使用的示例文件。
nr <- 5000
nc <- 1000
m <- matrix(round(rnorm(nr*nc),3),nr=nr)
cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
row.names = FALSE, col.names = FALSE)
更新:我尝试了 read.csv.sql
和 load("test.txt", arma::raw_ascii)
使用犰狳,两者都是比 scan
解决方案慢。
Reading ~5x10^6 numeric values into R from a text file is relatively slow on my machine (a few seconds, and I read several such files), even with scan(..., what="numeric", nmax=5000)
or similar tricks. Could it be worthwhile to try an Rcpp
wrapper for this sort of task (e.g. Armadillo
has a few utilities to read text files)?
Or would I likely be wasting my time for little to no gain in performance because of an expected interface overhead? I'm not sure what's currently limiting the speed (intrinsic machine performance, or else?) It's a task that I repeat many times a day, typically, and the file format is always the same, 1000 columns, around 5000 rows.
Here's a sample file to play with, if needed.
nr <- 5000
nc <- 1000
m <- matrix(round(rnorm(nr*nc),3),nr=nr)
cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
row.names = FALSE, col.names = FALSE)
Update: I tried read.csv.sql
and also load("test.txt", arma::raw_ascii)
using Armadillo and both were slower than the scan
solution.
发布评论
评论(4)
我强烈建议您查看最新版本的
data.table
中的fread
。 CRAN 上的版本 (1.8.6) 还没有fread
(在撰写本文时),因此如果您从 R-forge 的最新源进行安装,您应该能够获得它。请参阅此处。I highly recommend checking out
fread
in the latest version ofdata.table
. The version on CRAN (1.8.6) doesn't havefread
yet (at the time of this post) so you should be able to get it if you install from the latest source at R-forge. See here.请记住,我不是 R 专家,但也许这个概念也适用于这里:通常读取二进制内容比读取文本文件快得多。如果您的源文件不经常更改(例如,您在相同的数据上运行脚本/程序的不同版本),请尝试通过 scan() 读取它们一次并将它们存储为二进制格式(手册中有一个关于导出二进制文件)。
从那里您可以修改程序以读取二进制输入。
@Rcpp:扫描()&amp;朋友可能会调用本机实现(如 fscanf()),因此通过 Rcpp 编写自己的文件读取函数可能不会提供巨大的性能增益。不过,您仍然可以尝试(并针对您的特定数据进行优化)。
Please bear in mind that I'm not an R-expert but maybe the concept applies here too: usually reading binary stuff is much faster than reading text files. If your source files don't change frequently (e.g. you are running varied versions of your script/program on the same data), try to read them via scan() once and store them in a binary format (the manual has a chapter about exporting binary files).
From there on you can modify your program to read the binary input.
@Rcpp: scan() & friends are likely to call a native implementation (like fscanf()) so writing your own file read functions via Rcpp may not provide a huge performance gain. You can still try it though (and optimize for your particular data).
Salut Baptiste,
数据输入/输出是一个巨大的主题,以至于 R 自带了自己的 数据输入/输出手册。
R 的基本函数可能会很慢,因为它们非常通用。如果您知道自己的格式,您可以轻松地自己编写一个更快的导入适配器。如果您也知道自己的尺寸,那就更容易了,因为您只需要一次内存分配。
编辑:作为第一个近似值,我会编写一个 C++ 十行代码。打开一个文件,读取一行,将其分解为标记,分配给一个
vector双> >
或类似的东西。我认为,即使您在单个向量元素上使用push_back()
,您也应该与scan()
竞争。我曾经有一个简洁的 C++ 小型
csv reader
类,基于 Brian Kernighan 本人的代码。相当通用(对于 csv 文件),相当强大。然后,您可以根据需要压缩性能。
进一步编辑:这个SO问题有许多关于csv阅读案例的指针,以及对Kernighan和Plauger书籍的引用。
Salut Baptiste,
Data Input/Output is a huge topic, so big that R comes with its own manual on data input/output.
R's basic functions can be slow because they are so very generic. If you know your format, you can easily write yourself a faster import adapter. If you know your dimensions too, it is even easier as you need only one memory allocation.
Edit: As a first approximation, I would write a C++ ten-liner. Open a file, read a line, break it into tokens, assign to a
vector<vector< double > >
or something like that. Even if you usepush_back()
on individual vector elements, you should be competitive withscan()
, methinks.I once had a neat little
csv reader
class in C++ based on code by Brian Kernighan himself. Fairly generic (for csv files), fairly powerful.You can then squeeze performance as you see fit.
Further edit: This SO question has a number of pointers for the csv reading case, and references to the Kernighan and Plauger book.
是的,您几乎肯定可以创建比
read.csv
/scan
更快的东西。然而,对于高性能文件读取,有一些现有的技巧已经可以让你的速度更快,所以你所做的任何事情都会与这些技巧竞争。正如 Mathias 提到的,如果您的文件不经常更改,那么您可以通过调用
save
缓存它们,然后使用load
恢复它们。 (确保使用ascii = FALSE
,因为读取二进制文件会更快。)其次,正如 Gabor 提到的,通过将文件读入数据库,然后从数据库中读取文件,通常可以获得显着的性能提升。第三
,您可以使用 HadoopStreaming使用的包Hadoop的文件读取能力。
有关这些技术的更多想法,请参阅快速读取非常大的表R 中的数据帧。
Yes, you almost certainly can create something that goes faster than
read.csv
/scan
. However, for high performance file reading there are some existing tricks that already let you go much faster, so anything you do would be competing against those.As Mathias alluded to, if your files don't change very often, then you can cache them by calling
save
, then restore them withload
. (Make sure to useascii = FALSE
, since reading the binary files will be quicker.)Secondly, as Gabor mentioned, you can often get a substantial performance boost by reading your file into a database and then from that database into R.
Thirdly, you can use the HadoopStreaming package to use Hadoop's file reading capabilities.
For more thoughts in these techniques, see Quickly reading very large tables as dataframes in R.