查找两个文档之间的相似性
lucene 中是否有内置算法来查找两个文档之间的相似性? 当我浏览默认的相似性类别时,它在比较查询和文档后给出分数作为结果。
我已经索引了我的文档a,使用了雪球分析器,下一步是找到两个文档之间的相似性。
有人可以提出解决方案吗?
Is there a built-in algorithm to find the similarity between two documents in lucene ?
When i went through the default similarity class , it gives the score as a result after comparing the query and the document.
I have already indexed my document a, used the snowball analyzer , the next step would be to find the similarity between the two documents .
Can somebody suggest a solution ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
似乎没有内置算法。我相信您可以通过三种方式进行此操作:
a) 对其中一个文档运行 MoreLikeThis 查询。迭代结果,检查文档 ID 并获取分数。也许不太漂亮,您可能需要为您想要返回的文件返回大量文件。
b) 余弦相似度:Mikos 在他的评论中提供的链接中的答案解释了如何计算两个文档的余弦相似度。
c) 计算您自己的 Lucene 相似度分数。 Lucene 分数为余弦相似度添加了一些因素 (http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html)。
您可以使用
您可以获取参数,例如,
您可以使用 TermVector 获取两个文档中第一个文档的术语统计信息,并使用 IndexReader 获取集合统计信息。要获取 freq 参数,请使用
,迭代文档,直到找到第一个文档的文档 ID,然后执行
注意,您需要为每个术语(或每个术语)调用“scorer.score”您想要在第一个文档中考虑),并总结结果。
最后,要与“queryNorm”和“coord”参数相乘,您可以使用
所以这是一种应该可行的方法。它并不优雅,而且由于获取术语频率很困难(迭代每个术语的 DocsEnum),它也不是很有效。我仍然希望这对某人有帮助:)
There does not seem to be a built-in algorithm. I believe there are three ways you can go with this:
a) Run a MoreLikeThis query on one of the documents. Iterate through the results, check for doc id and get score. Maybe not pretty, you might need to return a lot of documents for the one you want to be among the returned ones.
b) Cosine Similarity: the answers at the link Mikos provided in his comment explain how Cosine similarity can be computed for two documents.
c) Compute your own Lucene Similarity Score. The Lucene score adds a few factors to Cosine Similarity (http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
You can use
You can get the parameters for example through
where in turn you can get the term stats using the TermVector for the first of your two documents, and your IndexReader for collection stats. To get the
freq
parameter, use, iterate through the docs until you find the doc id of your first document, and do
Note that you need to call "scorer.score" for each term (or each term you want to consider) in your first document, and sum up the results.
In the end, to multiply with the "queryNorm" and "coord" parameters, you can use
So this is a way that should work. It is not elegant and due to the difficulty of getting term frequencies (iterate over DocsEnum for each term), it is not very efficient either. I still hope something of this helps someone :)