转换 R 代码片段以使用 Matrix 包?

发布于 2024-08-20 07:31:46 字数 857 浏览 5 评论 0原文

我不确定是否有任何 R 用户,但以防万一:

我是 R 的新手,并且被善意地“传授”了以下 R 代码片段:

Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)

infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')

open(infile)
open(outfile)

for (i in 1:93049) {
  vec <- t(scan(infile, nlines=1))
  topics <- (vec/WordProbs) %*% Beta
  write.table(topics, outfile, append=T, row.names=F, col.names=F)
  }

当我尝试在我的数据集上运行此代码时,系统会崩溃并交换像疯了一样。现在我意识到原因很简单:文件 freq-matrix 包含一个大(22GB)矩阵,我试图将其读入内存。

我被告知使用 Matrix 包,因为 freq -matrix 到处都有很多很多零,它可以很好地处理这种情况。这会有帮助吗?如果是这样,任何有关如何更改此代码的提示将是最受欢迎的。我没有 R 经验,刚刚开始阅读网站上提供的 PDF 简介。

非常感谢

~l

I am not sure there are any R users out there, but just in case:

I am a novice at R and was kindly "handed down" the following R code snippet:

Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)

infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')

open(infile)
open(outfile)

for (i in 1:93049) {
  vec <- t(scan(infile, nlines=1))
  topics <- (vec/WordProbs) %*% Beta
  write.table(topics, outfile, append=T, row.names=F, col.names=F)
  }

When I tried running this on my dataset, the system thrashed and swapped like crazy. Now I realize that has a simple reason: the file freq-matrix holds a large (22GB) matrix and I was trying to read it into memory.

I have been told to use the Matrix package, because freq-matrix has many, many zeros all over the place and it handles such cases well. Will that help? If so, any hints on how to change this code would be most welcome. I have no R experience and just started reading through the introduction PDF available on the site.

Many thanks

~l

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冷默言语 2024-08-27 07:31:46

我的建议可能完全不对,因为您没有提供有关文件内容的足够详细信息,而我不得不从代码中猜测。无论如何,就这样吧。

您没有说明这一点,但我认为当您读取大矩阵时,您的代码会在第二行崩溃。该循环一次读取一行,并且不应崩溃。需要大矩阵的唯一原因是计算 WordProbs 向量。那么为什么不使用 scan 使用相同的循环来重写该部分呢?事实上,您可能甚至不需要存储 WordProbs 向量,只需存储 sum(WordFreq) - 您可以使用 hte 文件的初始运行来获得它。然后重写循环内的公式来计算当前的WordProb

My suggestion might be completely off, because you don't give enough details about the contents of your files, and I had to guess from the code. Anyway, here it goes.

You don't state it, but I would assume that your code crashes on the second line, when you read in the big matrix. The loop reads the lines one-at-a-time, and should not crash. The only reason you need that big matrix is to calculate the WordProbs vector. So why don't you rewrite that part using the same looping using scan? In fact, you could probably don't even need to store the WordProbs vector, just sum(WordFreq) - you can get that using an initial run through hte file. Then rewrite the formula within the loop to calculate the current WordProb.

最佳男配角 2024-08-27 07:31:46

迟来的答案,但我建议使用 bigmemory 包将数据读入内存映射文件。之后,我会查找非零条目,然后将其表示为 3 列矩阵:(ix_row, ix_col, value)。这称为坐标对象列表 (COO),尽管名称并不重要。从那里,Matrix 支持稀疏矩阵的创建(通过sparseMatrix)。获得 COO 后,您就已经准备就绪 - 与稀疏矩阵格式的转换相当快。将矩阵乘以 Beta 应该相当快。如果您需要更快的速度,可以使用优化的 BLAS 库,但这会带来更多问题。 :)

Belated answer, but I'd recommend reading the data into a memory mapped file, using the bigmemory package. After that, I'd look for the non-zero entries, which can then be represented as a 3 column matrix: (ix_row, ix_col, value). This is called a coordinate object list (COO), though the name is unimportant. From there, Matrix supports the creation of sparse matrices (via sparseMatrix). After you get the COO, you're pretty much set - conversion to and from the sparse matrix format is reasonably fast. Multiplying the matrix by Beta should be reasonably fast. If you need even greater speed, you could use an optimized BLAS library, but that opens up more questions. :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文