Lucene 中查询和文档之间的余弦相似度

发布于 2024-12-02 03:28:57 字数 306 浏览 8 评论 0原文

我想获得长查询和集合中的文档之间的余弦相似度。我使用 Lucence 来索引集合并提交查询来检索文档。

但是，对于某些查询，我收到以下错误。

"Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024"

我复制了查询中的一些术语以增加它们的权重。但 lucene 似乎只是做简单的布尔检索，而不是使用 tf-idf 计算文档和查询的余弦相似度。

有人能证实这一点吗？

原文

I wanted to get cosine similarity between a long query and a document in a collection. I'm using Lucence to index the collection and submitting the queries to retrieve documents.

However, I'm getting the following error for some of the queries.

"Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024"

I replicated some of the terms in the query to boost their weight. But it seems lucene is just doing simple boolean retrieval instead of calculating the cosine similarity using tf-idf for both document and query.

Can anybody confirm this ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

段念尘 2024-12-09 03:28:57

本页解释了 lucene 中使用的评分：

http: //lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

它指出：

文档 d 的查询 q 的得分与信息向量空间模型 (VSM) 中文档向量和查询向量之间的余弦距离或点积相关检索。向量与该模型中的查询向量更接近的文档得分更高。

所以不，lucene 不仅仅使用布尔检索。

您的异常与您的查询以及 lucene 转换它的方式有关。如果您能给出失败的查询的示例，将会很有帮助。

此外，你写道：

我复制了查询中的一些术语以增加它们的权重。

您不必这样做，只需为查询中的术语分配权重即可：
http://lucene.apache.org/java/2_0_0/queryparsersyntax.html

例如，要搜索 apple 和 Orange 以及 boost Orange，您可以编写：

apple orange^4

This page explains the scoring used in lucene:

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

It states:

The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM) of Information Retrieval. A document whose vector is closer to the query vector in that model is scored higher.

So no, lucene is not just using boolean retrieval.

Your exception is related to your query, and the way lucene transforms it. It would be helpful if you could give an example of a query that's failing.

Furthermore, you write:

I replicated some of the terms in the query to boost their weight.

You don't have to do that, instead you can simply assign a weight to the terms in your query:
http://lucene.apache.org/java/2_0_0/queryparsersyntax.html

E.g. to search for apple and orange, and boost orange, you can write: