转换 R 代码片段以使用 Matrix 包？

发布于 2024-08-20 07:31:46 字数 857 浏览 5 评论 0原文

我不确定是否有任何 R 用户，但以防万一：

我是 R 的新手，并且被善意地“传授”了以下 R 代码片段：

Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)

infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')

open(infile)
open(outfile)

for (i in 1:93049) {
  vec <- t(scan(infile, nlines=1))
  topics <- (vec/WordProbs) %*% Beta
  write.table(topics, outfile, append=T, row.names=F, col.names=F)
  }

当我尝试在我的数据集上运行此代码时，系统会崩溃并交换像疯了一样。现在我意识到原因很简单：文件 freq-matrix 包含一个大（22GB）矩阵，我试图将其读入内存。

我被告知使用 Matrix 包，因为 freq -matrix 到处都有很多很多零，它可以很好地处理这种情况。这会有帮助吗？如果是这样，任何有关如何更改此代码的提示将是最受欢迎的。我没有 R 经验，刚刚开始阅读网站上提供的 PDF 简介。

非常感谢

原文

I am not sure there are any R users out there, but just in case:

I am a novice at R and was kindly "handed down" the following R code snippet:

Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)

infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')

open(infile)
open(outfile)

for (i in 1:93049) {
  vec <- t(scan(infile, nlines=1))
  topics <- (vec/WordProbs) %*% Beta
  write.table(topics, outfile, append=T, row.names=F, col.names=F)
  }

When I tried running this on my dataset, the system thrashed and swapped like crazy. Now I realize that has a simple reason: the file freq-matrix holds a large (22GB) matrix and I was trying to read it into memory.

I have been told to use the Matrix package, because freq-matrix has many, many zeros all over the place and it handles such cases well. Will that help? If so, any hints on how to change this code would be most welcome. I have no R experience and just started reading through the introduction PDF available on the site.

Many thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷默言语 2024-08-27 07:31:46

我的建议可能完全不对，因为您没有提供有关文件内容的足够详细信息，而我不得不从代码中猜测。无论如何，就这样吧。

您没有说明这一点，但我认为当您读取大矩阵时，您的代码会在第二行崩溃。该循环一次读取一行，并且不应崩溃。需要大矩阵的唯一原因是计算 WordProbs 向量。那么为什么不使用 scan 使用相同的循环来重写该部分呢？事实上，您可能甚至不需要存储 WordProbs 向量，只需存储 sum(WordFreq) - 您可以使用 hte 文件的初始运行来获得它。然后重写循环内的公式来计算当前的WordProb。

回复收藏 0 原文

最佳男配角 2024-08-27 07:31:46

迟来的答案，但我建议使用 bigmemory 包将数据读入内存映射文件。之后，我会查找非零条目，然后将其表示为 3 列矩阵：(ix_row, ix_col, value)。这称为坐标对象列表 (COO)，尽管名称并不重要。从那里，Matrix 支持稀疏矩阵的创建（通过sparseMatrix）。获得 COO 后，您就已经准备就绪 - 与稀疏矩阵格式的转换相当快。将矩阵乘以 Beta 应该相当快。如果您需要更快的速度，可以使用优化的 BLAS 库，但这会带来更多问题。 :)