比 Rcpp 的 scan() 更快？

发布于 2024-12-29 21:54:57 字数 751 浏览 4 评论 0 原文

在我的机器上，从文本文件读取 ~5x10^6 数值到 R 中相对较慢（几秒钟，我读了几个这样的文件），即使使用 scan(..., What="numeric", nmax =5000) 或类似的技巧。是否值得尝试使用 Rcpp 包装器来完成此类任务（例如，Armadillo 有一些用于读取文本文件的实用程序）？或者，由于预期的接口开销，我可能会浪费时间而几乎没有提高性能吗？我不确定目前是什么限制了速度（机器固有性能，还是其他？）这是一项我通常每天重复多次的任务，并且文件格式始终相同，1000 列，大约 5000 行。

如果需要的话，这是一个可以使用的示例文件。

nr <- 5000
nc <- 1000

m <- matrix(round(rnorm(nr*nc),3),nr=nr)

cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
            row.names = FALSE, col.names = FALSE)

更新：我尝试了 read.csv.sql 和 load("test.txt", arma::raw_ascii) 使用犰狳，两者都是比 scan 解决方案慢。

原文

Reading ~5x10^6 numeric values into R from a text file is relatively slow on my machine (a few seconds, and I read several such files), even with scan(..., what="numeric", nmax=5000) or similar tricks. Could it be worthwhile to try an Rcpp wrapper for this sort of task (e.g. Armadillo has a few utilities to read text files)?
Or would I likely be wasting my time for little to no gain in performance because of an expected interface overhead? I'm not sure what's currently limiting the speed (intrinsic machine performance, or else?) It's a task that I repeat many times a day, typically, and the file format is always the same, 1000 columns, around 5000 rows.

Here's a sample file to play with, if needed.

nr <- 5000
nc <- 1000

m <- matrix(round(rnorm(nr*nc),3),nr=nr)

cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
            row.names = FALSE, col.names = FALSE)

Update: I tried read.csv.sql and also load("test.txt", arma::raw_ascii) using Armadillo and both were slower than the scan solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧人 2025-01-05 21:54:58

我强烈建议您查看最新版本的 data.table 中的 fread。 CRAN 上的版本 (1.8.6) 还没有 fread（在撰写本文时），因此如果您从 R-forge 的最新源进行安装，您应该能够获得它。请参阅此处。

回复收藏 0 原文

两相知 2025-01-05 21:54:58

请记住，我不是 R 专家，但也许这个概念也适用于这里：通常读取二进制内容比读取文本文件快得多。如果您的源文件不经常更改（例如，您在相同的数据上运行脚本/程序的不同版本），请尝试通过 scan() 读取它们一次并将它们存储为二进制格式（手册中有一个关于导出二进制文件）。
从那里您可以修改程序以读取二进制输入。

@Rcpp：扫描（）＆amp;朋友可能会调用本机实现（如 fscanf()），因此通过 Rcpp 编写自己的文件读取函数可能不会提供巨大的性能增益。不过，您仍然可以尝试（并针对您的特定数据进行优化）。