使用方便的 I/O 将大列表划分为块

发布于 2025-01-14 20:53:29 字数 457 浏览 4 评论 0原文

我有一个很大的列表,大小约为。 1.3GB。我正在寻找 R 中最快的解决方案来生成块并将它们保存为任何方便的格式,以便:

a)每个保存的块文件大小小于 100MB

b)原始文件可以方便快捷地将列表加载到新的 R 工作区中

编辑 II:这样做的原因是 R 解决方案可以绕过 GitHub 每个文件 100MB 的文件大小限制。 R 的限制是由于一些外部非技术限制造成的,我无法评论。

这个问题的最佳解决方案是什么?

编辑 I:因为评论中提到该问题的一些代码有助于创建一个更好的问题:

An R-example of a list with size of 1.3 GB:

li <- list(a = rnorm(10^8),
           b =  rnorm(10^7.8))

I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :

a) every saved file of the chunk is less than 100MB large

b) the original list can be loaded conveniently and fast into a new R workspace

EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.

What is the best solution for this problem?

EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:

An R-example of a list with size of 1.3 GB:

li <- list(a = rnorm(10^8),
           b =  rnorm(10^7.8))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

情独悲 2025-01-21 20:53:29

因此,您想要拆分文件并将其重新加载到单个数据帧中。

有一个问题:要减小文件大小,压缩是明智的做法,但文件大小并不完全确定。您可能需要调整一个参数。

以下是我用于类似任务的一段代码(尽管与 GitHub 无关)。

split.file 函数采用 3 个参数:数据帧、每个文件中要写入的行数以及基本文件名。例如,如果基本名称是“myfile”,则文件将为“myfile00001.rds”、“myfile00002.rds”等。
该函数返回写入的文件数。

join.files 函数采用基本名称。

注意:

  • 使用 rows 参数找出适合 100 MB 的正确大小。这取决于您的数据,但对于类似的数据集,固定大小应该可以。但是,如果您正在处理非常不同的数据集,这种方法可能会失败。
  • 读取时,您需要拥有数据帧占用的两倍内存(因为首先读取较小数据帧的列表,然后进行 rbinded。
  • 该数字写为 5 位数字,但您可以更改它。目标是按字典顺序排列名称,以便在连接文件时,行的顺序与原始文件的顺序相同

:示例

split.file <- function(db, rows, basename) {
  n = nrow(db)
  m = n %/% rows
  for (k in seq_len(m)) {
    db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
            compress = "xz", ascii = F)
  }
  if (m * rows < n) {
    db.sub <- db[seq(1 + m*rows, n), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
            compress = "xz", ascii = F)
    m <- m + 1
  }
  m
}

join.files <- function(basename) {
  files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
  do.call("rbind", lapply(files, readRDS))
}

n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)

So, you want to split a file and to reload it in a single dataframe.

There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.

The following is a piece of code I have used for a similar task (unrelated to GitHub though).

The split.file function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc.
The function returns the number of files written.

The join.files function takes the base name.

Note:

  • Play with the rows parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
  • When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
  • The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.

Here are the functions:

split.file <- function(db, rows, basename) {
  n = nrow(db)
  m = n %/% rows
  for (k in seq_len(m)) {
    db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
            compress = "xz", ascii = F)
  }
  if (m * rows < n) {
    db.sub <- db[seq(1 + m*rows, n), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
            compress = "xz", ascii = F)
    m <- m + 1
  }
  m
}

join.files <- function(basename) {
  files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
  do.call("rbind", lapply(files, readRDS))
}

Example:

n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文