转换 R 代码片段以使用 Matrix 包?
我不确定是否有任何 R 用户,但以防万一:
我是 R 的新手,并且被善意地“传授”了以下 R 代码片段:
Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)
infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')
open(infile)
open(outfile)
for (i in 1:93049) {
vec <- t(scan(infile, nlines=1))
topics <- (vec/WordProbs) %*% Beta
write.table(topics, outfile, append=T, row.names=F, col.names=F)
}
当我尝试在我的数据集上运行此代码时,系统会崩溃并交换像疯了一样。现在我意识到原因很简单:文件 freq-matrix 包含一个大(22GB)矩阵,我试图将其读入内存。
我被告知使用 Matrix 包,因为 freq -matrix 到处都有很多很多零,它可以很好地处理这种情况。这会有帮助吗?如果是这样,任何有关如何更改此代码的提示将是最受欢迎的。我没有 R 经验,刚刚开始阅读网站上提供的 PDF 简介。
非常感谢
~l
I am not sure there are any R users out there, but just in case:
I am a novice at R and was kindly "handed down" the following R code snippet:
Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)
infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')
open(infile)
open(outfile)
for (i in 1:93049) {
vec <- t(scan(infile, nlines=1))
topics <- (vec/WordProbs) %*% Beta
write.table(topics, outfile, append=T, row.names=F, col.names=F)
}
When I tried running this on my dataset, the system thrashed and swapped like crazy. Now I realize that has a simple reason: the file freq-matrix holds a large (22GB) matrix and I was trying to read it into memory.
I have been told to use the Matrix package, because freq-matrix has many, many zeros all over the place and it handles such cases well. Will that help? If so, any hints on how to change this code would be most welcome. I have no R experience and just started reading through the introduction PDF available on the site.
Many thanks
~l
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我的建议可能完全不对,因为您没有提供有关文件内容的足够详细信息,而我不得不从代码中猜测。无论如何,就这样吧。
您没有说明这一点,但我认为当您读取大矩阵时,您的代码会在第二行崩溃。该循环一次读取一行,并且不应崩溃。需要大矩阵的唯一原因是计算 WordProbs 向量。那么为什么不使用
scan
使用相同的循环来重写该部分呢?事实上,您可能甚至不需要存储WordProbs
向量,只需存储sum(WordFreq)
- 您可以使用 hte 文件的初始运行来获得它。然后重写循环内的公式来计算当前的WordProb
。My suggestion might be completely off, because you don't give enough details about the contents of your files, and I had to guess from the code. Anyway, here it goes.
You don't state it, but I would assume that your code crashes on the second line, when you read in the big matrix. The loop reads the lines one-at-a-time, and should not crash. The only reason you need that big matrix is to calculate the WordProbs vector. So why don't you rewrite that part using the same looping using
scan
? In fact, you could probably don't even need to store theWordProbs
vector, justsum(WordFreq)
- you can get that using an initial run through hte file. Then rewrite the formula within the loop to calculate the currentWordProb
.迟来的答案,但我建议使用
bigmemory
包将数据读入内存映射文件。之后,我会查找非零条目,然后将其表示为 3 列矩阵:(ix_row, ix_col, value)。这称为坐标对象列表 (COO),尽管名称并不重要。从那里,Matrix
支持稀疏矩阵的创建(通过sparseMatrix
)。获得 COO 后,您就已经准备就绪 - 与稀疏矩阵格式的转换相当快。将矩阵乘以 Beta 应该相当快。如果您需要更快的速度,可以使用优化的 BLAS 库,但这会带来更多问题。 :)Belated answer, but I'd recommend reading the data into a memory mapped file, using the
bigmemory
package. After that, I'd look for the non-zero entries, which can then be represented as a 3 column matrix: (ix_row, ix_col, value). This is called a coordinate object list (COO), though the name is unimportant. From there,Matrix
supports the creation of sparse matrices (viasparseMatrix
). After you get the COO, you're pretty much set - conversion to and from the sparse matrix format is reasonably fast. Multiplying the matrix byBeta
should be reasonably fast. If you need even greater speed, you could use an optimized BLAS library, but that opens up more questions. :)