Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 11 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
你看过 Lucene 和 Mahout 吗?
这可能很有用 - 潜在狄利克雷分配与 Lucene 和 Mahout。
Have you had a look at Lucene and Mahout?
This might be useful - Latent Dirichlet Allocation with Lucene and Mahout.
您可能会想到 LSA(潜在语义分析),这是此类问题的一种非常常见的解决方案的问题。
You might be thinking of LSA (Latent Semantic Analysis) which is a very common solution to this kind of problem.
有点旧,但对于任何仍然感兴趣的人,请看一下这个 博客文章(免责声明:这是我自己的博客)。如果您没有选择任何特定的方法,那么此处描述的算法和链接的代码可能会满足您的需要。
关于Shashikant的评论,余弦相似度可能不是一个好的选择,因为签名的长度与文档的长度成正比。最好使用恒定长度的签名。
A bit old, but for anyone still interested, take a look at this blog post (disclaimer: this is my own blog). The algorithm described there and the linked code will probably do what you need if you don't have your heart set on any specific approach.
Regarding Shashikant's comment, the cosine similarity may not be a good option because the signatures are proportional in length to the documents. Constant length signatures are preferable.
尝试使用此服务计算两个文档之间的余弦相似度
http://www.scurtu.it/documentSimilarity.html
Try this service for computing cosine similarity between two documents
http://www.scurtu.it/documentSimilarity.html