对推文执行 SVD。内存问题

发布于 2024-08-31 21:48:49 字数 666 浏览 14 评论 0原文

编辑：我的单词列表的大小比我写下的大 10-20 倍。我只是忘记了一个零。

EDIT2：我将研究 SVDLIBC 并了解如何将矩阵减少为其密集版本，这样也可能有所帮助。

我生成了一个巨大的 csv 文件作为我的词性标记和词干提取的输出。它看起来像这样：

        word1, word2, word3, ..., word 150.000
person1   1      2      0            1
person2   0      0      1            0
...
person650

它包含每个人的字数。像这样我得到每个人的特征向量。

我想在这个野兽上运行 SVD，但矩阵似乎太大，无法保存在内存中来执行操作。我的问题是：

我是否应该通过删除列总和为 1 的单词来减小列大小，这意味着它们仅使用过一次。我是否通过此尝试使数据偏差太大？
我尝试了rapidminer尝试，将csv加载到数据库中。然后按顺序批量读取它进行处理，就像rapidminer建议的那样。但是Mysql 不能在表中存储那么多列。如果我转置数据，然后在导入时重新转置它也需要很长时间....

-->因此，总的来说，我正在寻求如何在这样的语料库上执行 svd 的建议。

EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero.

EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too.

I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:

        word1, word2, word3, ..., word 150.000
person1   1      2      0            1
person2   0      0      1            0
...
person650

It contains the word counts for each person. Like this I am getting characteristic vectors for each person.

I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:

should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?
I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....

--> So in general I am asking for advice how to perform a svd on such a corpus.