查找两个文档之间的相似性

发布于 2024-12-26 14:02:17 字数 139 浏览 4 评论 0原文

lucene 中是否有内置算法来查找两个文档之间的相似性? 当我浏览默认的相似性类别时,它在比较查询和文档后给出分数作为结果。

我已经索引了我的文档a,使用了雪球分析器,下一步是找到两个文档之间的相似性。

有人可以提出解决方案吗?

Is there a built-in algorithm to find the similarity between two documents in lucene ?
When i went through the default similarity class , it gives the score as a result after comparing the query and the document.

I have already indexed my document a, used the snowball analyzer , the next step would be to find the similarity between the two documents .

Can somebody suggest a solution ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

残龙傲雪 2025-01-02 14:02:17

似乎没有内置算法。我相信您可以通过三种方式进行此操作:

a) 对其中一个文档运行 MoreLikeThis 查询。迭代结果,检查文档 ID 并获取分数。也许不太漂亮,您可能需要为您想要返回的文件返回大量文件。

b) 余弦相似度:Mikos 在他的评论中提供的链接中的答案解释了如何计算两个文档的余弦相似度。

c) 计算您自己的 Lucene 相似度分数。 Lucene 分数为余弦相似度添加了一些因素 (http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html)。

您可以使用

DefaultSimilarity ds = new DefaultSimilarity();
SimScorer scorer = ds.simScorer(stats , arc);
scorer.score(otherDocId, freq);

您可以获取参数,例如,

AtomicReaderContext arc = IndexReader.leaves().get(0);
SimWeight stats = ds.computeWeight(1, collectionStats, termStats);
stats.normalize(1, 1);

您可以使用 TermVector 获取两个文档中第一个文档的术语统计信息,并使用 IndexReader 获取集合统计信息。要获取 freq 参数,请使用

DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, null, field, term);

,迭代文档,直到找到第一个文档的文档 ID,然后执行

freq = docsEnum.freq();

注意,您需要为每个术语(或每个术语)调用“scorer.score”您想要在第一个文档中考虑),并总结结果。

最后,要与“queryNorm”和“coord”参数相乘,您可以使用

//sumWeights was computed while iterating over the first termvector
//in the main loop by summing up "stats.getValueForNormalization();"
float queryNorm = ds.queryNorm(sumWeights);
//thisTV and otherTV are termvectors for the two documents.
//overlap can be easily calculated
float coord = ds.coord(overlap, (int) Math.min(thisTV.size(), otherTV.size()));
return coord * queryNorm * score;

所以这是一种应该可行的方法。它并不优雅,而且由于获取术语频率很困难(迭代每个术语的 DocsEnum),它也不是很有效。我仍然希望这对某人有帮助:)

There does not seem to be a built-in algorithm. I believe there are three ways you can go with this:

a) Run a MoreLikeThis query on one of the documents. Iterate through the results, check for doc id and get score. Maybe not pretty, you might need to return a lot of documents for the one you want to be among the returned ones.

b) Cosine Similarity: the answers at the link Mikos provided in his comment explain how Cosine similarity can be computed for two documents.

c) Compute your own Lucene Similarity Score. The Lucene score adds a few factors to Cosine Similarity (http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).

You can use

DefaultSimilarity ds = new DefaultSimilarity();
SimScorer scorer = ds.simScorer(stats , arc);
scorer.score(otherDocId, freq);

You can get the parameters for example through

AtomicReaderContext arc = IndexReader.leaves().get(0);
SimWeight stats = ds.computeWeight(1, collectionStats, termStats);
stats.normalize(1, 1);

where in turn you can get the term stats using the TermVector for the first of your two documents, and your IndexReader for collection stats. To get the freq parameter, use

DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, null, field, term);

, iterate through the docs until you find the doc id of your first document, and do

freq = docsEnum.freq();

Note that you need to call "scorer.score" for each term (or each term you want to consider) in your first document, and sum up the results.

In the end, to multiply with the "queryNorm" and "coord" parameters, you can use

//sumWeights was computed while iterating over the first termvector
//in the main loop by summing up "stats.getValueForNormalization();"
float queryNorm = ds.queryNorm(sumWeights);
//thisTV and otherTV are termvectors for the two documents.
//overlap can be easily calculated
float coord = ds.coord(overlap, (int) Math.min(thisTV.size(), otherTV.size()));
return coord * queryNorm * score;

So this is a way that should work. It is not elegant and due to the difficulty of getting term frequencies (iterate over DocsEnum for each term), it is not very efficient either. I still hope something of this helps someone :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文