tm 包错误“由于向量太大，无法将 DocumentTermMatrix 转换为普通矩阵”

发布于 2024-12-03 21:59:12 字数 601 浏览 5 评论 0原文

我创建了一个 DocumentTermMatrix，其中包含 1859 个文档（行）和 25722 个文档（列）。为了对该矩阵进行进一步计算，我需要将其转换为常规矩阵。我想使用 as.matrix() 命令。但是，它返回以下错误：无法分配大小为 364.8 MB 的向量。

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

由于某种原因，每当将对象转换为常规矩阵时，对象的大小似乎都会急剧增加。我怎样才能避免这种情况？

或者是否有其他方法可以对 DocumentTermMatrix 执行常规矩阵运算？

原文

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?

Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

草莓酥 2024-12-10 21:59:12

快速但肮脏的方法是将数据从外部包（如 Matrix）导出到稀疏矩阵对象中。

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"

dtm 对象具有 i、j 和 v 属性，它们是 DocumentTermMatrix 的内部表示。使用：

library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )

你就完成了。

对象之间的天真比较：

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

每个对象都会给出完全相同的输出。

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"

The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:

library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )

and you're done.

A naive comparison between your objects:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

will each give you the exact same output.

回复收藏 0 原文

热血少△年 2024-12-10 21:59:12

DocumentTermMatrix 使用稀疏矩阵表示形式，因此它不会占用存储所有这些零的所有内存。根据您想要做什么，您可能会幸运地使用 SparseM包提供了一些使用稀疏矩阵的线性代数例程。

回复收藏 0 原文

病女 2024-12-10 21:59:12

您可以增加 R 可用的 RAM 量吗？请参阅这篇文章：增加（或减少）R 进程可用的内存

此外，有时在 R 中处理大对象时，我偶尔会调用 gc() 来释放浪费的内存。

回复收藏 0 原文

病毒体 2024-12-10 21:59:12

文档的数量应该不是问题，但您可能想尝试删除稀疏术语，这可以很好地减少文档术语矩阵的维度。

inspect(removeSparseTerms(dtm, 0.7))

它删除稀疏度至少为 0.7 的项。

您可以使用的另一个选项是，您在创建文档术语矩阵时指定最小字长和最小文档频率，

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

在更改之前和之后使用 inspect(dtm) ，您将看到巨大的差异，更重要的是您赢得了不要破坏隐藏在文档和术语中的重要关系。

The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.

inspect(removeSparseTerms(dtm, 0.7))

It removes terms that has at least a sparsity of 0.7.

Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.

回复收藏 0 原文

若言繁花未落 2024-12-10 21:59:12

由于您只有 1859 个文档，因此您需要计算的距离矩阵相当小。使用 slam 包（特别是其 crossapply_simple_triplet_matrix 函数），您可以直接计算距离矩阵，而不是首先将 DTM 转换为密集矩阵。这意味着您必须自己计算 Jaccard 相似度。我已成功尝试类似的操作大量文档上的余弦距离矩阵。

回复收藏 0 原文

~没有更多了~