对推文执行 SVD。内存问题

发布于 2024-08-31 21:48:49 字数 666 浏览 14 评论 0原文

编辑:我的单词列表的大小比我写下的大 10-20 倍。我只是忘记了一个零。

EDIT2:我将研究 SVDLIBC 并了解如何将矩阵减少为其密集版本,这样也可能有所帮助。

我生成了一个巨大的 csv 文件作为我的词性标记和词干提取的输出。它看起来像这样:

        word1, word2, word3, ..., word 150.000
person1   1      2      0            1
person2   0      0      1            0
...
person650

它包含每个人的字数。像这样我得到每个人的特征向量。

我想在这个野兽上运行 SVD,但矩阵似乎太大,无法保存在内存中来执行操作。我的问题是:

  • 我是否应该通过删除列总和为 1 的单词来减小列大小,这意味着它们仅使用过一次。我是否通过此尝试使数据偏差太大?

  • 我尝试了rapidminer尝试,将csv加载到数据库中。然后按顺序批量读取它进行处理,就像rapidminer建议的那样。但是Mysql 不能在表中存储那么多列。如果我转置数据,然后在导入时重新转置它也需要很长时间....

-->因此,总的来说,我正在寻求如何在这样的语料库上执行 svd 的建议。

EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero.

EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too.

I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:

        word1, word2, word3, ..., word 150.000
person1   1      2      0            1
person2   0      0      1            0
...
person650

It contains the word counts for each person. Like this I am getting characteristic vectors for each person.

I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:

  • should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?

  • I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....

--> So in general I am asking for advice how to perform a svd on such a corpus.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

瑕疵 2024-09-07 21:48:49

这是一个大的密集矩阵。然而,它只是一个很小的稀疏矩阵。

使用稀疏矩阵 SVD 算法就足够了。例如此处

This is a big dense matrix. However, it is only a small a small sparse matrix.

Using a sparse matrix SVD algorithm is enough. e.g. here.

七秒鱼° 2024-09-07 21:48:49

SVD 受内存大小的限制。请参阅:

折叠:一篇关于部分矩阵更新。

Apache Mahout 是一个分布式数据挖掘库,运行在 hadoop 上,具有并行 SVD

SVD is constrained by your memory size. See:

Folding In: a paper on partial matrix updates.

Apache Mahout is a distributed data mining library that runs on hadoop which has a parallel SVD

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文