使用combn()和bigmemory包生成一个非常大的字符串组合矩阵
我有一个由 1,344 个唯一字符串组成的向量 x。我想生成一个矩阵,为我提供所有可能的三个值组(无论顺序如何),并将其导出到 csv。
我在 64 位 Ubuntu 的 m1.large 实例上的 EC2 上运行 R。使用 comen(x, 3) 时出现内存不足错误:
Error: cannot allocate vector of size 9.0 Gb
结果矩阵的大小为 C1344,3 = 403,716,544 行和三列 - 这是 commn() 函数结果的转置。
我想使用 bigmemory 包创建一个支持 big.matrix 的文件,这样我就可以分配 commn() 函数的结果。我可以创建一个预先分配的大矩阵:
library(bigmemory)
x <- as.character(1:1344)
combos <- 403716544
test <- filebacked.big.matrix(nrow = combos, ncol = 3,
init = 0, backingfile = "test.matrix")
但是当我尝试分配值 test <- comen(x, 3)
我仍然得到相同的结果:Error:无法分配大小为 9.0 的向量Gb
我什至尝试强制 combn(x,3)
的结果,但我认为因为 commn() 函数返回错误,所以 big.matrix 函数也不起作用。
test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc")
Error: cannot allocate vector of size 9.0 Gb
Error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") :
error in evaluating the argument 'x' in selecting a method for function 'as.big.matrix'
有没有办法将这两个功能结合在一起以获得我所需要的?还有其他方法可以实现这一目标吗?谢谢。
I have a vector x of 1,344 unique strings. I want to generate a matrix that gives me all possible groups of three values, regardless of order, and export that to a csv.
I'm running R on EC2 on a m1.large instance w 64bit Ubuntu. When using combn(x, 3) I get an out of memory error:
Error: cannot allocate vector of size 9.0 Gb
The size of the resulting matrix is C1344,3 = 403,716,544 rows and three columns - which is the transpose of the result of combn() function.
I thought of using the bigmemory package to create a file backed big.matrix so I can then assign the results of the combn() function. I can create a preallocated big matrix:
library(bigmemory)
x <- as.character(1:1344)
combos <- 403716544
test <- filebacked.big.matrix(nrow = combos, ncol = 3,
init = 0, backingfile = "test.matrix")
But when I try to allocate the values test <- combn(x, 3)
I still get the same: Error: cannot allocate vector of size 9.0 Gb
I even tried coercing the result of combn(x,3)
but I think that because the combn() function is returning an error, the big.matrix function doesn't work either.
test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc")
Error: cannot allocate vector of size 9.0 Gb
Error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") :
error in evaluating the argument 'x' in selecting a method for function 'as.big.matrix'
Is there a way to combine these two functions together to get what I need? Are there any other ways of achieving this? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

发布评论
评论(3)
您可以首先找到所有 2 路组合,然后将它们与 3d 值组合,同时每次保存它们。这需要更少的内存:
combn.mod <- function(x,fname){
tmp <- combn(x,2,simplify=F)
n <- length(x)
for ( i in x[-c(n,n-1)]){
# Drop all combinations that contain value i
id <- which(!unlist(lapply(tmp,function(t) i %in% t)))
tmp <- tmp[id]
# add i to all other combinations and write to file
out <- do.call(rbind,lapply(tmp,c,i))
write(t(out),file=fname,ncolumns=3,append=T,sep=",")
}
}
combn.mod(x,"F:/Tmp/Test.txt")
但这并不像约书亚的答案那么普遍,它是专门针对你的情况的。我想它更快——同样,对于这个特殊情况——但我没有进行比较。当应用于您的 x 时,该函数在我的计算机上运行,使用略多于 50 Mb(粗略估计)的空间。
编辑
旁注:如果这是出于模拟目的,我发现很难相信任何科学应用程序都需要 400 多万次模拟运行。您可能会在这里询问错误问题的正确答案...
概念证明:
我通过 tt[[i]]<-out
更改了写入行,添加了 tt
- 循环之前的 list()
和循环之后的 return(tt) 。然后:
> do.call(rbind,combn.mod(letters[1:5]))
[,1] [,2] [,3]
[1,] "b" "c" "a"
[2,] "b" "d" "a"
[3,] "b" "e" "a"
[4,] "c" "d" "a"
[5,] "c" "e" "a"
[6,] "d" "e" "a"
[7,] "c" "d" "b"
[8,] "c" "e" "b"
[9,] "d" "e" "b"
[10,] "d" "e" "c"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
这是我用 R 编写的一个函数,它目前在 LSPM 中找到其(未导出的)主目录 包。你给它总的项目数
n
,选择的项目数r
,以及你想要的组合的索引i
;它返回与组合i
相对应的1:n
中的值。它允许您根据词典索引的值生成每个组合:
因此您只需循环 1:403716544 并将结果附加到文件中。这可能需要一段时间,但至少是可行的(参见德克的回答)。您可能还需要在多个循环中执行此操作,因为向量
1:403716544
不适合我的机器上的内存。或者您可以将 R 代码移植到 C/C++ 并在那里进行循环/写入,因为它会快很多。
Here's a function I've written in R, which currently finds its (unexported) home in the LSPM package. You give it the total number of items
n
, the number of items to selectr
, and the index of the combination you wanti
; it returns the values in1:n
corresponding to combinationi
.It allows you to generate each combination based on the value of the lexicographic index:
So you just need to loop over 1:403716544 and append the results to a file. It may take awhile, but it's at least feasible (see Dirk's answer). You also may need to do it in several loops, since the vector
1:403716544
will not fit in memory on my machine.Or you could just port the R code to C/C++ and do the looping / writing there, since it would be a lot faster.