将文档术语矩阵转换为包含大量数据的矩阵会导致溢出

发布于 2024-11-26 16:53:33 字数 1084 浏览 2 评论 0原文

让我们做一些文本挖掘

这里我站着一个文档术语矩阵（来自tm包）

dtm <- TermDocumentMatrix(
     myCorpus,
     control = list(
         weight = weightTfIdf,
         tolower=TRUE,
         removeNumbers = TRUE,
         minWordLength = 2,
         removePunctuation = TRUE,
         stopwords=stopwords("german")
      ))

当我做a时

typeof(dtm)

我看到它是一个“列表”并且结构看起来像

Docs
Terms        1 2 ...
  lorem      0 0 ...
  ipsum      0 0 ...
  ...        .......

所以我尝试了一个

wordMatrix = as.data.frame( t(as.matrix(  dtm )) )

可行的方法1000 个文档。

但当我尝试使用 40000 时，它就不再起作用了。

我收到此错误：

Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt

矢量错误...：矢量不能为 NA 额外的：在由整数溢出创建的 nr * nc NA 中，

所以我查看了 as.matrix ，结果发现该函数以某种方式将其转换为带有 as.vector 的向量，然后转换为矩阵。到向量的转换有效，但从向量到矩阵的转换无效。

您有什么建议可能是什么问题吗？

谢谢，船长

原文

Let's do some Text Mining

Here I stand with a document term matrix (from the tm Package)

dtm <- TermDocumentMatrix(
     myCorpus,
     control = list(
         weight = weightTfIdf,
         tolower=TRUE,
         removeNumbers = TRUE,
         minWordLength = 2,
         removePunctuation = TRUE,
         stopwords=stopwords("german")
      ))

When I do a

typeof(dtm)

I see that it is a "list" and the structure looks like

Docs
Terms        1 2 ...
  lorem      0 0 ...
  ipsum      0 0 ...
  ...        .......

So I try a

wordMatrix = as.data.frame( t(as.matrix(  dtm )) )

That works for 1000 Documents.

But when I try to use 40000 it doesn't anymore.

I get this error:

Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt

Error in vector ... : Vector can't be NA
Additional:
In nr * nc NAs created by integer overflow

So I looked at as.matrix and it turns out that somehow the function converts it to a vector with as.vector and than to a matrix.
The convertion to a vector works but not the one from the vector to the matrix dosen't.

Do you have any suggestions what could be the problem?

Thanks, The Captain

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你的呼吸 2024-12-03 16:53:33

整数溢出准确地告诉您问题所在：对于 40000 个文档，您的数据太多了。顺便说一句，问题是在转换为矩阵时开始的，如果您查看底层函数的代码就可以看出这一点：

class(dtm)
[1] "TermDocumentMatrix"    "simple_triplet_matrix"

getAnywhere(as.matrix.simple_triplet_matrix)

A single object matching ‘as.matrix.simple_triplet_matrix’ was found
...
function (x, ...) 
{
    nr <- x$nrow
    nc <- x$ncol
    y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
   ...
}

这是错误消息引用的行。发生了什么，可以通过以下方式轻松模拟：

as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame
[1] NA
Warning message:
NAs introduced by coercion

函数 vector() 接受一个带有长度的参数，在本例中是 nr*nc 如果它大于 appx. 2e9 ( .Machine$integer.max )，它将被 NA 替换。该 NA 作为 vector() 的参数无效。

底线：您正在遇到 R 的限制。就目前而言，在 64 位中工作对您没有帮助。您将不得不诉诸不同的方法。一种可能性是继续使用您拥有的列表（dtm 是一个列表），使用列表操作选择您需要的数据并从那里开始。

PS：我制作了一个 dtm 对象

require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
                          control = list(weighting = weightTfIdf,
                                         stopwords = TRUE))

Integer overflow tells you exactly what the problem is : with 40000 documents, you have too much data. It is in the conversion to a matrix that the problem begins btw, which can be seen if you look at the code of the underlying function :

class(dtm)
[1] "TermDocumentMatrix"    "simple_triplet_matrix"

getAnywhere(as.matrix.simple_triplet_matrix)

A single object matching ‘as.matrix.simple_triplet_matrix’ was found
...
function (x, ...) 
{
    nr <- x$nrow
    nc <- x$ncol
    y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
   ...
}

This is the line referenced by the error message. What's going on, can be easily simulated by :

as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame
[1] NA
Warning message:
NAs introduced by coercion

The function vector() takes an argument with the length, in this case nr*nc If this is larger than appx. 2e9 ( .Machine$integer.max ), it will be replaced by NA. This NA is not valid as an argument for vector().

Bottomline : You're running into the limits of R. As for now, working in 64bit won't help you. You'll have to resort to different methods. One possibility would be to continue working with the list you have (dtm is a list), selecting the data you need using list manipulation and go from there.

PS : I made a dtm object by

require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
                          control = list(weighting = weightTfIdf,
                                         stopwords = TRUE))

回复收藏 0 原文

长伴 2024-12-03 16:53:33

这是我最近发现的一个非常非常简单的解决方案

DTM=t(TDM)#taking the transpose of Term-Document Matrix though not necessary but I prefer DTM over TDM
M=as.big.matrix(x=as.matrix(DTM))#convert the DTM into a bigmemory object using the bigmemory package 
M=as.matrix(M)#convert the bigmemory object again to a regular matrix
M=t(M)#take the transpose again to get TDM

请注意，采用 TDM 转置来获得 DTM 是绝对可选的，这是我个人偏好以这种方式处理矩阵

PS4 年前我无法回答这个问题，因为我只是一个新人我的大学

Here is a very very simple solution I discovered recently

DTM=t(TDM)#taking the transpose of Term-Document Matrix though not necessary but I prefer DTM over TDM
M=as.big.matrix(x=as.matrix(DTM))#convert the DTM into a bigmemory object using the bigmemory package 
M=as.matrix(M)#convert the bigmemory object again to a regular matrix
M=t(M)#take the transpose again to get TDM

Please note that taking transpose of TDM to get DTM is absolutely optional, it's my personal preference to play with matrices this way

P.S.Could not answer the question 4 years back as I was just a fresh entry in my college

回复收藏 0 原文

归属感 2024-12-03 16:53:33

根据 Joris Meys 的回答，我找到了解决方案。关于“length”参数的“vector()”文档

...
对于长向量，即长度> .Machine$integer.max，它必须是“double”类型...

所以我们可以对 as.matrix() 进行一个微小的修复：

as.big.matrix <- function(x) {
  nr <- x$nrow
  nc <- x$ncol
  # nr and nc are integers. 1 is double. Double * integer -> double
  y <- matrix(vector(typeof(x$v), 1 * nr * nc), nr, nc)
  y[cbind(x$i, x$j)] <- x$v
  dimnames(y) <- x$dimnames
  y
}

Based on Joris Meys answer, I've found the solution. "vector()" documentation regarding "length" argument

...
For a long vector, i.e., length > .Machine$integer.max, it has to be of type "double"...

So we can make a tiny fix of the as.matrix():

as.big.matrix <- function(x) {
  nr <- x$nrow
  nc <- x$ncol
  # nr and nc are integers. 1 is double. Double * integer -> double
  y <- matrix(vector(typeof(x$v), 1 * nr * nc), nr, nc)
  y[cbind(x$i, x$j)] <- x$v
  dimnames(y) <- x$dimnames
  y
}

回复收藏 0 原文

~没有更多了~