R：如何在不耗尽内存的情况下重新绑定两个巨大的数据帧

发布于 2024-10-13 03:00:16 字数 634 浏览 12 评论 0原文

我有两个数据帧 df1 和 df2，每个数据帧大约有 1000 万行和 4 列。我使用 RODBC/sqlQuery 将它们读入 R，没有任何问题，但是当我尝试重新绑定它们时，我收到了最可怕的 R 错误消息：无法分配内存。必须有更有效的方法来更有效地进行rbind——任何人都有他们最喜欢的技巧想要分享吗？例如，我在 sqldf 的文档中找到了这个示例：

# rbind
a7r <- rbind(a5r, a6r)
a7s <- sqldf("select * from a5s union all select * from a6s")

这是最好/推荐的方法吗？

更新我使用上面 sqldf 调用中关键的 dbname = tempfile() 参数让它工作，正如 JD Long 在他对这个问题

原文

I have two data-frames df1 and df2 that each have around 10 million rows and 4 columns. I read them into R using RODBC/sqlQuery with no problems, but when I try to rbind them, I get that most dreaded of R error messages: cannot allocate memory. There have got to be more efficient ways to do an rbind more efficiently -- anyone have their favorite tricks on this they want to share? For instance I found this example in the doc for sqldf:

# rbind
a7r <- rbind(a5r, a6r)
a7s <- sqldf("select * from a5s union all select * from a6s")

Is that the best/recommended way to do it?

UPDATE
I got it to work using the crucial dbname = tempfile() argument in the sqldf call above, as JD Long suggests in his answer to this question

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乄_柒ぐ汐 2024-10-20 03:00:16

您可以让 SQLite 读取它们并在将它们发送到 R 之前将它们组合起来，而不是在开始时将它们读入 R，然后组合它们。这样，文件就不会单独加载到 R 中。

# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)

DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)

# now we do the real work
library(sqldf)

data1 <- file("data1.dat")
data2 <- file("data2.dat")

sqldf(c("select * from data1", 
 "insert into data1 select * from data2", 
 "select * from data1"), 
 dbname = tempfile())

这给出了：

>  sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
   A  B
1  1  2
2  2  3
3 10 12
4 11 13

如果行顺序，这个较短的版本也可以工作不重要：

sqldf("select * from data1 union select * from data2", dbname = tempfile())

请参阅 sqldf 主页 http://sqldf.googlecode.com 和 ?sqldf 了解更多信息。请特别注意文件格式参数，因为它们与 read.table 接近但不相同。在这里我们使用了默认值，所以这不是一个问题。

Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.

# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)

DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)

# now we do the real work
library(sqldf)

data1 <- file("data1.dat")
data2 <- file("data2.dat")

sqldf(c("select * from data1", 
 "insert into data1 select * from data2", 
 "select * from data1"), 
 dbname = tempfile())

This gives:

>  sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
   A  B
1  1  2
2  2  3
3 10 12
4 11 13

This shorter version also works if row order is unimportant:

sqldf("select * from data1 union select * from data2", dbname = tempfile())

See the sqldf home page http://sqldf.googlecode.com and ?sqldf for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table. Here we have used the defaults so it was less of an issue.

回复收藏 0 原文

迷乱花海 2024-10-20 03:00:16

请注意 data.table R 包可对具有数百万条记录的对象进行高效操作。

该软件包的 1.8.2 版本提供了 rbindlist 函数，通过它您可以非常有效地实现您想要的目标。因此，您可以：而不是 rbind(a5r, a6r)

library(data.table)
rbindlist(list(a5r, a6r))

Notice the data.table R package for efficient operations on objects with over several million records.

Version 1.8.2 of that package offers the rbindlist function through which you can achieve what you want very efficiently. Thus instead of rbind(a5r, a6r) you can:

library(data.table)
rbindlist(list(a5r, a6r))

回复收藏 0 原文

小红帽 2024-10-20 03:00:16

尝试创建所需大小的 data.frame，然后使用下标导入数据。

dtf <- as.data.frame(matrix(NA, 10, 10))
dtf1 <- as.data.frame(matrix(1:50, 5, 10, byrow=TRUE))
dtf2 <- as.data.frame(matrix(51:100, 5, 10, byrow=TRUE))
dtf[1:5, ] <- dtf1
dtf[6:10, ] <- dtf2

我猜 rbind 会在没有预先分配其尺寸的情况下增长对象......我不确定，这只是一个猜测。今晚我将梳理“R 地狱”或“R 数据操作”。也许merge可以解决问题...

编辑

并且您应该记住，（也许）您的系统和/或R无法应对那么大的事情。尝试 RevolutionR，也许您会节省一些时间/资源。

Try to create a data.frame of desired size, hence import your data using subscripts.

dtf <- as.data.frame(matrix(NA, 10, 10))
dtf1 <- as.data.frame(matrix(1:50, 5, 10, byrow=TRUE))
dtf2 <- as.data.frame(matrix(51:100, 5, 10, byrow=TRUE))
dtf[1:5, ] <- dtf1
dtf[6:10, ] <- dtf2

I guess that rbind grows object without pre-allocating its dimensions... I'm not positively sure, this is only a guess. I'll comb down "The R Inferno" or "Data Manipulation with R" tonight. Maybe merge will do the trick...

EDIT

And you should bare in mind that (maybe) your system and/or R cannot cope with something that big. Try RevolutionR, maybe you'll manage to spare some time/resources.

回复收藏 0 原文

梦幻之岛 2024-10-20 03:00:16

为了使本线程中有关 union:ing 大文件的主题更加完整，请尝试在文件上使用 shell 命令来组合它们。在 Windows 中，这是带有“/B”标志的“COPY”命令。示例：

system(command =
         paste0(
           c("cmd.exe /c COPY /Y"
             , '"file_1.csv" /B'
             , '+ "file_2.csv" /B'
             , '"resulting_file.csv" /B'
           ), collapse = " "
         )
)#system

要求文件没有标题和相同的分隔符等。
shell 命令的速度和多功能性有时会带来很大的好处，因此在映射数据流时不要忘记 CLI 命令。

For completeness in this thread on the topic of union:ing large files, try using shell commands on the files to combine them. In windows that is "COPY" command with "/B" flag. Example:

system(command =
         paste0(
           c("cmd.exe /c COPY /Y"
             , '"file_1.csv" /B'
             , '+ "file_2.csv" /B'
             , '"resulting_file.csv" /B'
           ), collapse = " "
         )
)#system

Requires that files have no header, and same delimiter etc etc.
The speed and versatility of shell commands is sometimes a great benefit, so don't forget CLI-commands when mapping out dataflows.

回复收藏 0 原文

~没有更多了~