rbind 的内存高效替代方案 - 就地 rbind?
我需要重新绑定两个大数据框。现在我使用
df <- rbind(df, df.extension)
但我(几乎)立即耗尽了内存。我猜这是因为 df 在内存中保存了两次。我将来可能会看到更大的数据帧,所以我需要某种就地 rbind。
所以我的问题是:使用 rbind 时有没有办法避免内存中的数据重复?
I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
data.table
是你的朋友!参见 http://www.mail-archive.com/[email protected]/msg175877.html
继 nikola 的评论之后,这里是
?rbindlist
的描述(v1.1 中的新功能) 8.2):data.table
is your friend!C.f. http://www.mail-archive.com/[email protected]/msg175877.html
Following up on nikola's comment, here is
?rbindlist
's description (new in v1.8.2) :首先:如果您想安全,请使用您链接到的其他问题中的解决方案。由于 R 是按值调用,因此请忘记不会在内存中复制数据帧的“就地”方法。
一种不建议的节省大量内存的方法是假装你的数据帧是列表,使用for循环强制列表(应用会消耗大量内存)并让R相信它实际上是一个数据框。
我再次警告您:在更复杂的数据帧上使用它会带来麻烦和难以发现的错误。因此,请确保您的测试足够好,并且如果可能的话,尽可能避免这种情况。
您可以尝试以下方法:
它会删除您实际拥有的行名(您可以重建它们,但检查重复的行名!)。它也不执行 rbind 中包含的所有其他测试。
在我的测试中节省了大约一半的内存,并且在我的测试中 dtfcomb 和 dtf 是相等的。红色框是rbind,黄色框是我基于列表的方法。
测试脚本:
First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.
One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.
I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.
You could try following approach :
It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.
Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.
Test script :
现在我制定了以下解决方案:
现在我没有耗尽内存。我认为这是因为我
在使用 rbind R 时
存储,之后我用它
来释放我不再需要的内存。
这暂时解决了我的问题,但我觉得有一种更高级的方法可以实现内存高效的 rbind。我感谢对此解决方案的任何评论。
Right now I worked out the following solution:
Now I don't run out of memory. I think its because I store
while with rbind R would need
After that I use
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.
这是
bigmemory
的完美候选者。请参阅网站了解更多信息。以下是需要考虑的三个使用方面:separated = TRUE
使列分开。我没有使用这么多,因为我的第三个技巧:rbind
。注意:虽然最初的问题涉及数据帧和大内存适用于矩阵,但如果确实有必要,可以轻松地为不同类型的数据创建不同的矩阵,然后组合 RAM 中的对象来创建数据帧。
This is a perfect candidate for
bigmemory
. See the site for more information. Here are three usage aspects to consider:separated = TRUE
to make the columns separate. I haven't used this much, because of my 3rd tip:rbind
.Note: Although the original question addressed data frames and bigmemory is suitable for matrices, one can easily create different matrices for different types of data and then combine the objects in RAM to create a dataframe, if it's really necessary.