Lucene 中查询和文档之间的余弦相似度
我想获得长查询和集合中的文档之间的余弦相似度。我使用 Lucence 来索引集合并提交查询来检索文档。
但是,对于某些查询,我收到以下错误。
"Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024"
我复制了查询中的一些术语以增加它们的权重。但 lucene 似乎只是做简单的布尔检索,而不是使用 tf-idf 计算文档和查询的余弦相似度。
有人能证实这一点吗?
I wanted to get cosine similarity between a long query and a document in a collection. I'm using Lucence to index the collection and submitting the queries to retrieve documents.
However, I'm getting the following error for some of the queries.
"Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024"
I replicated some of the terms in the query to boost their weight. But it seems lucene is just doing simple boolean retrieval instead of calculating the cosine similarity using tf-idf for both document and query.
Can anybody confirm this ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
本页解释了 lucene 中使用的评分:
http: //lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
它指出:
所以不,lucene 不仅仅使用布尔检索。
您的异常与您的查询以及 lucene 转换它的方式有关。如果您能给出失败的查询的示例,将会很有帮助。
此外,你写道:
您不必这样做,只需为查询中的术语分配权重即可:
http://lucene.apache.org/java/2_0_0/queryparsersyntax.html
例如,要搜索 apple 和 Orange 以及 boost Orange,您可以编写:
This page explains the scoring used in lucene:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
It states:
So no, lucene is not just using boolean retrieval.
Your exception is related to your query, and the way lucene transforms it. It would be helpful if you could give an example of a query that's failing.
Furthermore, you write:
You don't have to do that, instead you can simply assign a weight to the terms in your query:
http://lucene.apache.org/java/2_0_0/queryparsersyntax.html
E.g. to search for apple and orange, and boost orange, you can write: