对推文执行 SVD。内存问题
编辑:我的单词列表的大小比我写下的大 10-20 倍。我只是忘记了一个零。
EDIT2:我将研究 SVDLIBC 并了解如何将矩阵减少为其密集版本,这样也可能有所帮助。
我生成了一个巨大的 csv 文件作为我的词性标记和词干提取的输出。它看起来像这样:
word1, word2, word3, ..., word 150.000
person1 1 2 0 1
person2 0 0 1 0
...
person650
它包含每个人的字数。像这样我得到每个人的特征向量。
我想在这个野兽上运行 SVD,但矩阵似乎太大,无法保存在内存中来执行操作。我的问题是:
我是否应该通过删除列总和为 1 的单词来减小列大小,这意味着它们仅使用过一次。我是否通过此尝试使数据偏差太大?
我尝试了rapidminer尝试,将csv加载到数据库中。然后按顺序批量读取它进行处理,就像rapidminer建议的那样。但是Mysql 不能在表中存储那么多列。如果我转置数据,然后在导入时重新转置它也需要很长时间....
-->因此,总的来说,我正在寻求如何在这样的语料库上执行 svd 的建议。
EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero.
EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too.
I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:
word1, word2, word3, ..., word 150.000
person1 1 2 0 1
person2 0 0 1 0
...
person650
It contains the word counts for each person. Like this I am getting characteristic vectors for each person.
I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:
should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?
I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....
--> So in general I am asking for advice how to perform a svd on such a corpus.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个大的密集矩阵。然而,它只是一个很小的稀疏矩阵。
使用稀疏矩阵 SVD 算法就足够了。例如此处。
This is a big dense matrix. However, it is only a small a small sparse matrix.
Using a sparse matrix SVD algorithm is enough. e.g. here.
SVD 受内存大小的限制。请参阅:
折叠:一篇关于部分矩阵更新。
Apache Mahout 是一个分布式数据挖掘库,运行在 hadoop 上,具有并行 SVD
SVD is constrained by your memory size. See:
Folding In: a paper on partial matrix updates.
Apache Mahout is a distributed data mining library that runs on hadoop which has a parallel SVD