tm 包错误“由于向量太大,无法将 DocumentTermMatrix 转换为普通矩阵”

发布于 2024-12-03 21:59:12 字数 601 浏览 0 评论 0原文

我创建了一个 DocumentTermMatrix,其中包含 1859 个文档(行)和 25722 个文档(列)。为了对该矩阵进行进一步计算,我需要将其转换为常规矩阵。我想使用 as.matrix() 命令。但是,它返回以下错误:无法分配大小为 364.8 MB 的向量。

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

由于某种原因,每当将对象转换为常规矩阵时,对象的大小似乎都会急剧增加。我怎样才能避免这种情况?

或者是否有其他方法可以对 DocumentTermMatrix 执行常规矩阵运算?

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?

Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

草莓酥 2024-12-10 21:59:12

快速但肮脏的方法是将数据从外部包(如 Matrix)导出到稀疏矩阵对象中。

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"            

dtm 对象具有 i、j 和 v 属性,它们是 DocumentTermMatrix 的内部表示。使用:

library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )

你就完成了。

对象之间的天真比较:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

每个对象都会给出完全相同的输出。

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"            

The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:

library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )

and you're done.

A naive comparison between your objects:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

will each give you the exact same output.

热血少△年 2024-12-10 21:59:12

DocumentTermMatrix 使用 稀疏矩阵 表示形式,因此它不会占用存储所有这些零的所有内存。根据您想要做什么,您可能会幸运地使用 SparseM包提供了一些使用稀疏矩阵的线性代数例程。

DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..

病女 2024-12-10 21:59:12

您可以增加 R 可用的 RAM 量吗?请参阅这篇文章:增加(或减少)R 进程可用的内存

此外,有时在 R 中处理大对象时,我偶尔会调用 gc() 来释放浪费的内存。

Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes

Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.

病毒体 2024-12-10 21:59:12

文档的数量应该不是问题,但您可能想尝试删除稀疏术语,这可以很好地减少文档术语矩阵的维度。

inspect(removeSparseTerms(dtm, 0.7))

它删除稀疏度至少为 0.7 的项。

您可以使用的另一个选项是,您在创建文档术语矩阵时指定最小字长和最小文档频率,

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

在更改之前和之后使用 inspect(dtm) ,您将看到巨大的差异,更重要的是您赢得了不要破坏隐藏在文档和术语中的重要关系。

The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.

inspect(removeSparseTerms(dtm, 0.7))

It removes terms that has at least a sparsity of 0.7.

Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.

若言繁花未落 2024-12-10 21:59:12

由于您只有 1859 个文档,因此您需要计算的距离矩阵相当小。使用 slam 包(特别是其 crossapply_simple_triplet_matrix 函数),您可以直接计算距离矩阵,而不是首先将 DTM 转换为密集矩阵。这意味着您必须自己计算 Jaccard 相似度。我已成功尝试 类似的操作大量文档上的余弦距离矩阵。

Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文