对LSA的质疑

发布于 2024-12-29 09:28:07 字数 634 浏览 4 评论 0原文

我必须找到参考文档与存储库中的文档集之间的相似性。

Method : 

1. I find the term document matrix for all the documents including the reference document 
2. The svd is calculated for this matrix 
3. I take the v array(The third result)
4. I transpose this matrix so that the each row represents a document . 
5. The first row represents the reference document . 
6. I find the cosine similarity beween this row and the rest of the rows

我的疑问：

由于我的数据库中有大约 7 个文档，所以我只得到 8*8 varray(文档矩阵) 。那么，如果我单独找到这 8 个值的余弦相似度，我会得到正确的结果吗？
这种方法普遍采用吗？

我用java来编码这个。我使用 jama 包来查找 svd 。

原文

I have to find the similarity between a reference document and the set of documents in a repository .

Method : 

1. I find the term document matrix for all the documents including the reference document 
2. The svd is calculated for this matrix 
3. I take the v array(The third result)
4. I transpose this matrix so that the each row represents a document . 
5. The first row represents the reference document . 
6. I find the cosine similarity beween this row and the rest of the rows

My doubts :

Since i have around 7 documents in my db , i get only 8*8 varray(document matrix) . SO will i get a correct result if i find the cosine similarity with these 8 values alone ?
Is such a method adopted generally ?

I use java to code this . I make use of the jama package to find the svd .

分享到QQ

分享到微博