rbind 的内存高效替代方案 - 就地 rbind？

发布于 2024-11-29 19:17:35 字数 379 浏览 2 评论 0原文

我需要重新绑定两个大数据框。现在我使用

df <- rbind(df, df.extension)

但我（几乎）立即耗尽了内存。我猜这是因为 df 在内存中保存了两次。我将来可能会看到更大的数据帧，所以我需要某种就地 rbind。

所以我的问题是：使用 rbind 时有没有办法避免内存中的数据重复？

我发现这个问题，它使用SqlLite，但我真的想避免使用硬盘作为缓存。

原文

I need to rbind two large data frames. Right now I use

df <- rbind(df, df.extension)

but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.

So my question is: Is there a way to avoid data duplication in memory when using rbind?

I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

束缚ｍ 2024-12-06 19:17:35

data.table 是你的朋友！

参见 http://www.mail-archive.com/[email protected]/msg175877.html

继 nikola 的评论之后，这里是 ?rbindlist 的描述（v1.1 中的新功能） 8.2）：

与do.call("rbind",l)相同，但速度更快。

回复收藏 0 原文

迷爱 2024-12-06 19:17:35

首先：如果您想安全，请使用您链接到的其他问题中的解决方案。由于 R 是按值调用，因此请忘记不会在内存中复制数据帧的“就地”方法。

一种不建议的节省大量内存的方法是假装你的数据帧是列表，使用for循环强制列表（应用会消耗大量内存）并让R相信它实际上是一个数据框。

我再次警告您：在更复杂的数据帧上使用它会带来麻烦和难以发现的错误。因此，请确保您的测试足够好，并且如果可能的话，尽可能避免这种情况。

您可以尝试以下方法：

n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))

dtf <- list()

for(i in names(dtf1)){
  dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}

attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"

它会删除您实际拥有的行名（您可以重建它们，但检查重复的行名！）。它也不执行 rbind 中包含的所有其他测试。

在我的测试中节省了大约一半的内存，并且在我的测试中 dtfcomb 和 dtf 是相等的。红色框是rbind，黄色框是我基于列表的方法。

在此处输入图像描述

测试脚本：

n1 <- 3000000
n2 <- 3000000
ncols <- 20

dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))

gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
  dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()

First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.

One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.

I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.

You could try following approach :

n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))

dtf <- list()

for(i in names(dtf1)){
  dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}

attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"

It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.

Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.

enter image description here

Test script :

n1 <- 3000000
n2 <- 3000000
ncols <- 20

dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))

gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
  dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()

回复收藏 0 原文

帅的被狗咬 2024-12-06 19:17:35

现在我制定了以下解决方案：

nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)

现在我没有耗尽内存。我认为这是因为我

object.size(df) + 2 * object.size(df.extension)

在使用 rbind R 时

object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).

存储，之后我用它

rm(df.extension)
gc(reset=TRUE)

来释放我不再需要的内存。

这暂时解决了我的问题，但我觉得有一种更高级的方法可以实现内存高效的 rbind。我感谢对此解决方案的任何评论。

Right now I worked out the following solution:

nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)

Now I don't run out of memory. I think its because I store

object.size(df) + 2 * object.size(df.extension)

while with rbind R would need

object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).

After that I use

rm(df.extension)
gc(reset=TRUE)

to free the memory I don't need anymore.

This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.

回复收藏 0 原文

怪我入戏太深 2024-12-06 19:17:35

这是bigmemory 的完美候选者。请参阅网站了解更多信息。以下是需要考虑的三个使用方面：

可以使用 HD：内存映射到 HD 的速度比任何其他访问快得多，因此您可能不会看到任何速度减慢。有时我依赖> 1TB 内存映射矩阵，但大多数在 6 到 50GB 之间。此外，由于对象是一个矩阵，因此不需要为了使用该对象而重写代码的实际开销。
无论您是否使用文件支持的矩阵，都可以使用 separated = TRUE 使列分开。我没有使用这么多，因为我的第三个技巧：
您可以过度分配 HD 空间以允许更大的潜在矩阵大小，但仅加载感兴趣的子矩阵。这样就不需要进行rbind。

注意：虽然最初的问题涉及数据帧和大内存适用于矩阵，但如果确实有必要，可以轻松地为不同类型的数据创建不同的矩阵，然后组合 RAM 中的对象来创建数据帧。