矩阵 TFIDF 的降维
我计算了 TFIdf(术语频率,逆文档频率),我发现在这一步之后,有必要使用 LSI、卡方检验等方法来减少我的矩阵的维数...,
我不知道我如何在java中实现卡方测试以减少矩阵TFIDF的维数,如果有一些库可以做到这一点或教程其中他们解释了我如何做到这一点,请告诉我
I calculate the TFIdf(term frequency,inverse document frequency) and i have seen that after this step it is necessary to reduce the dimension of My Matrix with using methods like LSI ,chi -square test...,
I haven't any idea how i can implement chi square test in java for dimensionality reduction of matrix TFIDF,if there is some library to do this or tutorial in which they explain how i can do this, tell me please
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 gensims 库进行 LSA、LDA。
它实际上可以对任何大型数据集执行 LSA。它不会立即将整个语料库加载到内存中,而是进行延迟读取。
use gensims library for LSA, LDA.
It can practically perform LSA for any large dataset. It does not load the entire corpus into memory at once but does a lazy read.
我不认为你想做卡方;这不是一种降维技术。
你要做的是 SVD,即奇异值分解。这就是 LSI/LSA 中使用的降维技术。
维基百科建议在 Java 中使用名为“S-Space Pacakage”的 LSA 库。我自己没用过,但你可能想研究一下。
http://code.google.com/p/airhead-research/
I don't think you want to do chi-square; that's not a technique for dimension reduction.
What you want to do is SVD, or singular value decomposition. That is the technique used in LSI/LSA for dimensionality reduction.
Wikipedia suggests using a library called 'S-Space Pacakage' for LSA in Java. I haven't used it myself, but you may want to look into it.
http://code.google.com/p/airhead-research/