使用combn()和bigmemory包生成一个非常大的字符串组合矩阵

发布于 2024-10-08 04:53:12 字数 1116 浏览 11 评论 0原文

我有一个由 1,344 个唯一字符串组成的向量 x。我想生成一个矩阵,为我提供所有可能的三个值组(无论顺序如何),并将其导出到 csv。

我在 64 位 Ubuntu 的 m1.large 实例上的 EC2 上运行 R。使用 comen(x, 3) 时出现内存不足错误:

Error: cannot allocate vector of size 9.0 Gb

结果矩阵的大小为 C1344,3 = 403,716,544 行和三列 - 这是 commn() 函数结果的转置。

我想使用 bigmemory 包创建一个支持 big.matrix 的文件,这样我就可以分配 commn() 函数的结果。我可以创建一个预先分配的大矩阵:

library(bigmemory)
x <- as.character(1:1344)
combos <- 403716544
test <- filebacked.big.matrix(nrow = combos, ncol = 3, 
        init = 0, backingfile = "test.matrix")

但是当我尝试分配值 test <- comen(x, 3) 我仍然得到相同的结果:Error:无法分配大小为 9.0 的向量Gb

我什至尝试强制 combn(x,3) 的结果,但我认为因为 commn() 函数返回错误,所以 big.matrix 函数也不起作用。

test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc")
Error: cannot allocate vector of size 9.0 Gb
Error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") : 
  error in evaluating the argument 'x' in selecting a method for function 'as.big.matrix'

有没有办法将这两个功能结合在一起以获得我所需要的?还有其他方法可以实现这一目标吗?谢谢。

I have a vector x of 1,344 unique strings. I want to generate a matrix that gives me all possible groups of three values, regardless of order, and export that to a csv.

I'm running R on EC2 on a m1.large instance w 64bit Ubuntu. When using combn(x, 3) I get an out of memory error:

Error: cannot allocate vector of size 9.0 Gb

The size of the resulting matrix is C1344,3 = 403,716,544 rows and three columns - which is the transpose of the result of combn() function.

I thought of using the bigmemory package to create a file backed big.matrix so I can then assign the results of the combn() function. I can create a preallocated big matrix:

library(bigmemory)
x <- as.character(1:1344)
combos <- 403716544
test <- filebacked.big.matrix(nrow = combos, ncol = 3, 
        init = 0, backingfile = "test.matrix")

But when I try to allocate the values test <- combn(x, 3) I still get the same: Error: cannot allocate vector of size 9.0 Gb

I even tried coercing the result of combn(x,3) but I think that because the combn() function is returning an error, the big.matrix function doesn't work either.

test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc")
Error: cannot allocate vector of size 9.0 Gb
Error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") : 
  error in evaluating the argument 'x' in selecting a method for function 'as.big.matrix'

Is there a way to combine these two functions together to get what I need? Are there any other ways of achieving this? Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无力看清 2024-10-15 04:53:12

这是我用 R 编写的一个函数,它目前在 LSPM 中找到其(未导出的)主目录 包。你给它总的项目数n,选择的项目数r,以及你想要的组合的索引i;它返回与组合i相对应的1:n中的值。

".combinadic" <- function(n, r, i) {

  # http://msdn.microsoft.com/en-us/library/aa289166(VS.71).aspx
  # http://en.wikipedia.org/wiki/Combinadic

  if(i < 1 | i > choose(n,r)) stop("'i' must be 0 < i <= n!/(n-r)!")

  largestV <- function(n, r, i) {
    #v <- n-1
    v <- n                                  # Adjusted for one-based indexing
    #while(choose(v,r) > i) v <- v-1
    while(choose(v,r) >= i) v <- v-1        # Adjusted for one-based indexing
    return(v)
  }

  res <- rep(NA,r)
  for(j in 1:r) {
    res[j] <- largestV(n,r,i)
    i <- i-choose(res[j],r)
    n <- res[j]
    r <- r-1
  }
  res <- res + 1
  return(res)
}

它允许您根据词典索引的值生成每个组合:

> .combinadic(1344, 3, 1)
[1] 3 2 1
> .combinadic(1344, 3, 2)
[1] 4 2 1
> .combinadic(1344, 3, 403716544)
[1] 1344 1343 1342

因此您只需循环 1:403716544 并将结果附加到文件中。这可能需要一段时间,但至少是可行的(参见德克的回答)。您可能还需要在多个循环中执行此操作,因为向量 1:403716544 不适合我的机器上的内存。

或者您可以将 R 代码移植到 C/C++ 并在那里进行循环/写入,因为它会快很多。

Here's a function I've written in R, which currently finds its (unexported) home in the LSPM package. You give it the total number of items n, the number of items to select r, and the index of the combination you want i; it returns the values in 1:n corresponding to combination i.

".combinadic" <- function(n, r, i) {

  # http://msdn.microsoft.com/en-us/library/aa289166(VS.71).aspx
  # http://en.wikipedia.org/wiki/Combinadic

  if(i < 1 | i > choose(n,r)) stop("'i' must be 0 < i <= n!/(n-r)!")

  largestV <- function(n, r, i) {
    #v <- n-1
    v <- n                                  # Adjusted for one-based indexing
    #while(choose(v,r) > i) v <- v-1
    while(choose(v,r) >= i) v <- v-1        # Adjusted for one-based indexing
    return(v)
  }

  res <- rep(NA,r)
  for(j in 1:r) {
    res[j] <- largestV(n,r,i)
    i <- i-choose(res[j],r)
    n <- res[j]
    r <- r-1
  }
  res <- res + 1
  return(res)
}

It allows you to generate each combination based on the value of the lexicographic index:

> .combinadic(1344, 3, 1)
[1] 3 2 1
> .combinadic(1344, 3, 2)
[1] 4 2 1
> .combinadic(1344, 3, 403716544)
[1] 1344 1343 1342

So you just need to loop over 1:403716544 and append the results to a file. It may take awhile, but it's at least feasible (see Dirk's answer). You also may need to do it in several loops, since the vector 1:403716544 will not fit in memory on my machine.

Or you could just port the R code to C/C++ and do the looping / writing there, since it would be a lot faster.

痴梦一场 2024-10-15 04:53:12

您可以首先找到所有 2 路组合,然后将它们与 3d 值组合,同时每次保存它们。这需要更少的内存:

combn.mod <- function(x,fname){
  tmp <- combn(x,2,simplify=F)
  n <- length(x)
  for ( i in x[-c(n,n-1)]){
    # Drop all combinations that contain value i
    id <- which(!unlist(lapply(tmp,function(t) i %in% t)))
    tmp <- tmp[id]
    # add i to all other combinations and write to file
    out <- do.call(rbind,lapply(tmp,c,i))
    write(t(out),file=fname,ncolumns=3,append=T,sep=",")
  }
}

combn.mod(x,"F:/Tmp/Test.txt")

但这并不像约书亚的答案那么普遍,它是专门针对你的情况的。我想它更快——同样,对于这个特殊情况——但我没有进行比较。当应用于您的 x 时,该函数在我的计算机上运行,​​使用略多于 50 Mb(粗略估计)的空间。

编辑

旁注:如果这是出于模拟目的,我发现很难相信任何科学应用程序都需要 400 多万次模拟运行。您可能会在这里询问错误问题的正确答案...

概念证明:

我通过 tt[[i]]<-out 更改了写入行,添加了 tt - 循环之前的 list() 和循环之后的 return(tt) 。然后:

> do.call(rbind,combn.mod(letters[1:5]))
      [,1] [,2] [,3]
 [1,] "b"  "c"  "a" 
 [2,] "b"  "d"  "a" 
 [3,] "b"  "e"  "a" 
 [4,] "c"  "d"  "a" 
 [5,] "c"  "e"  "a" 
 [6,] "d"  "e"  "a" 
 [7,] "c"  "d"  "b" 
 [8,] "c"  "e"  "b" 
 [9,] "d"  "e"  "b" 
[10,] "d"  "e"  "c" 

You could first find all 2-way combinations, and then just combine them with the 3d value while saving them every time. This takes a lot less memory:

combn.mod <- function(x,fname){
  tmp <- combn(x,2,simplify=F)
  n <- length(x)
  for ( i in x[-c(n,n-1)]){
    # Drop all combinations that contain value i
    id <- which(!unlist(lapply(tmp,function(t) i %in% t)))
    tmp <- tmp[id]
    # add i to all other combinations and write to file
    out <- do.call(rbind,lapply(tmp,c,i))
    write(t(out),file=fname,ncolumns=3,append=T,sep=",")
  }
}

combn.mod(x,"F:/Tmp/Test.txt")

This is not as general as Joshua's answer though, it is specifically for your case. I guess it is faster -again, for this particular case-, but I didn't make the comparison. Function works on my computer using little over 50 Mb (roughly estimated) when applied to your x.

EDIT

On a sidenote: If this is for simulation purposes, I find it hard to believe that any scientific application needs 400+ million simulation runs. You might be asking the correct answer to the wrong question here...

PROOF OF CONCEPT :

I changed the write line by tt[[i]]<-out, added tt <- list() before the loop and return(tt) after it. Then:

> do.call(rbind,combn.mod(letters[1:5]))
      [,1] [,2] [,3]
 [1,] "b"  "c"  "a" 
 [2,] "b"  "d"  "a" 
 [3,] "b"  "e"  "a" 
 [4,] "c"  "d"  "a" 
 [5,] "c"  "e"  "a" 
 [6,] "d"  "e"  "a" 
 [7,] "c"  "d"  "b" 
 [8,] "c"  "e"  "b" 
 [9,] "d"  "e"  "b" 
[10,] "d"  "e"  "c" 
坐在坟头思考人生 2024-10-15 04:53:12

初步估计,每种算法都会牺牲存储空间来换取速度。

您在尝试预分配完全枚举的组合矩阵时遇到了边界。因此,也许您应该尝试不要预先分配这个矩阵,而是尝试,例如,

  1. 如果您认为需要组合,请在其他地方计算它们并将它们存储在一个简单的数据库(或者,哎呀,平面文件)中并查找它们-- 节省 9 GB

  2. 利用开源的,读取 combn() 的代码并将其修改为客户端-服务器事物:给定索引号N的调用,它将循环并返回第 N 个条目。效率不高,但可能更容易可行

At a first approximation, every algorithm trades off storage for speed.

You have hit a boundary trying to preallocate your fully enumerated combination matrix. So maybe you should try not to preallocate this matrix but to try, say,

  1. If you think you need the combinations, calculate them somewhere else and store them in a simple db (or, heck, flat file) and look them up -- 9 gb saved

  2. Take advantage of open source, read the code to combn() and modify it into a client-server thingy: given a call with index number N, it will loop and return the Nth entry. Not efficient, but possibly more easily feasible.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文