比 Rcpp 的 scan() 更快?

发布于 2024-12-29 21:54:57 字数 751 浏览 4 评论 0 原文

在我的机器上,从文本文件读取 ~5x10^6 数值到 R 中相对较慢(几秒钟,我读了几个这样的文件),即使使用 scan(..., What="numeric", nmax =5000) 或类似的技巧。是否值得尝试使用 Rcpp 包装器来完成此类任务(例如,Armadillo 有一些用于读取文本文件的实用程序)? 或者,由于预期的接口开销,我可能会浪费时间而几乎没有提高性能吗?我不确定目前是什么限制了速度(机器固有性能,还是其他?)这是一项我通常每天重复多次的任务,并且文件格式始终相同,1000 列,大约 5000 行。

如果需要的话,这是一个可以使用的示例文件。

nr <- 5000
nc <- 1000

m <- matrix(round(rnorm(nr*nc),3),nr=nr)

cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
            row.names = FALSE, col.names = FALSE)

更新:我尝试了 read.csv.sqlload("test.txt", arma::raw_ascii) 使用犰狳,两者都是比 scan 解决方案慢。

Reading ~5x10^6 numeric values into R from a text file is relatively slow on my machine (a few seconds, and I read several such files), even with scan(..., what="numeric", nmax=5000) or similar tricks. Could it be worthwhile to try an Rcpp wrapper for this sort of task (e.g. Armadillo has a few utilities to read text files)?
Or would I likely be wasting my time for little to no gain in performance because of an expected interface overhead? I'm not sure what's currently limiting the speed (intrinsic machine performance, or else?) It's a task that I repeat many times a day, typically, and the file format is always the same, 1000 columns, around 5000 rows.

Here's a sample file to play with, if needed.

nr <- 5000
nc <- 1000

m <- matrix(round(rnorm(nr*nc),3),nr=nr)

cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
            row.names = FALSE, col.names = FALSE)

Update: I tried read.csv.sql and also load("test.txt", arma::raw_ascii) using Armadillo and both were slower than the scan solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

旧人 2025-01-05 21:54:58

我强烈建议您查看最新版本的 data.table 中的 fread。 CRAN 上的版本 (1.8.6) 还没有 fread(在撰写本文时),因此如果您从 R-forge 的最新源进行安装,您应该能够获得它。请参阅此处

I highly recommend checking out fread in the latest version of data.table. The version on CRAN (1.8.6) doesn't have fread yet (at the time of this post) so you should be able to get it if you install from the latest source at R-forge. See here.

两相知 2025-01-05 21:54:58

请记住,我不是 R 专家,但也许这个概念也适用于这里:通常读取二进制内容比读取文本文件快得多。如果您的源文件不经常更改(例如,您在相同的数据上运行脚本/程序的不同版本),请尝试通过 scan() 读取它们一次并将它们存储为二进制格式(手册中有一个关于导出二进制文件)。
从那里您可以修改程序以读取二进制输入。

@Rcpp:扫描()&amp;朋友可能会调用本机实现(如 fscanf()),因此通过 Rcpp 编写自己的文件读取函数可能不会提供巨大的性能增益。不过,您仍然可以尝试(并针对您的特定数据进行优化)。

Please bear in mind that I'm not an R-expert but maybe the concept applies here too: usually reading binary stuff is much faster than reading text files. If your source files don't change frequently (e.g. you are running varied versions of your script/program on the same data), try to read them via scan() once and store them in a binary format (the manual has a chapter about exporting binary files).
From there on you can modify your program to read the binary input.

@Rcpp: scan() & friends are likely to call a native implementation (like fscanf()) so writing your own file read functions via Rcpp may not provide a huge performance gain. You can still try it though (and optimize for your particular data).

北城孤痞 2025-01-05 21:54:58

Salut Baptiste,

数据输入/输出是一个巨大的主题,以至于 R 自带了自己的 数据输入/输出手册

R 的基本函数可能会很慢,因为它们非常通用。如果您知道自己的格式,您可以轻松地自己编写一个更快的导入适配器。如果您也知道自己的尺寸,那就更容易了,因为您只需要一次内存分配。

编辑:作为第一个近似值,我会编写一个 C++ 十行代码。打开一个文件,读取一行,将其分解为标记,分配给一个 vector双> > 或类似的东西。我认为,即使您在单个向量元素上使用 push_back() ,您也应该与 scan() 竞争。

我曾经有一个简洁的 C++ 小型 csv reader 类,基于 Brian Kernighan 本人的代码。相当通用(对于 csv 文件),相当强大。

然后,您可以根据需要压缩性能。

进一步编辑:这个SO问题有许多关于csv阅读案例的指针,以及对Kernighan和Plauger书籍的引用。

Salut Baptiste,

Data Input/Output is a huge topic, so big that R comes with its own manual on data input/output.

R's basic functions can be slow because they are so very generic. If you know your format, you can easily write yourself a faster import adapter. If you know your dimensions too, it is even easier as you need only one memory allocation.

Edit: As a first approximation, I would write a C++ ten-liner. Open a file, read a line, break it into tokens, assign to a vector<vector< double > > or something like that. Even if you use push_back() on individual vector elements, you should be competitive with scan(), methinks.

I once had a neat little csv reader class in C++ based on code by Brian Kernighan himself. Fairly generic (for csv files), fairly powerful.

You can then squeeze performance as you see fit.

Further edit: This SO question has a number of pointers for the csv reading case, and references to the Kernighan and Plauger book.

幸福%小乖 2025-01-05 21:54:58

是的,您几乎肯定可以创建比 read.csv/scan 更快的东西。然而,对于高性能文件读取,有一些现有的技巧已经可以让你的速度更快,所以你所做的任何事情都会与这些技巧竞争。

正如 Mathias 提到的,如果您的文件不经常更改,那么您可以通过调用 save 缓存它们,然后使用 load 恢复它们。 (确保使用 ascii = FALSE,因为读取二进制文件会更快。)

其次,正如 Gabor 提到的,通过将文件读入数据库,然后从数据库中读取文件,通常可以获得显着的性能提升。第三

,您可以使用 HadoopStreaming使用的包Hadoop的文件读取能力。

有关这些技术的更多想法,请参阅快速读取非常大的表R 中的数据帧

Yes, you almost certainly can create something that goes faster than read.csv/scan. However, for high performance file reading there are some existing tricks that already let you go much faster, so anything you do would be competing against those.

As Mathias alluded to, if your files don't change very often, then you can cache them by calling save, then restore them with load. (Make sure to use ascii = FALSE, since reading the binary files will be quicker.)

Secondly, as Gabor mentioned, you can often get a substantial performance boost by reading your file into a database and then from that database into R.

Thirdly, you can use the HadoopStreaming package to use Hadoop's file reading capabilities.

For more thoughts in these techniques, see Quickly reading very large tables as dataframes in R.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文