如何标准化 Lucene 分数?
我需要将 Lucene 分数标准化为 0 到 1 之间。
例如,随机查询返回以下分数...
8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242
0.33730242
0.33730242
0.33730242
最大分数是多少? 10.0?
谢谢
I need to normalize the Lucene scores between 0 and 1.
For example, a random query returns the following scores...
8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242
0.33730242
0.33730242
0.33730242
What's the biggest score ? 10.0 ?
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以将所有分数除以最大分数以获得 0 到 1 之间的分数。
但是,请注意,标准化分数只能用于比较单个查询的结果。比较 2 个不同查询的结果的分数(标准化或非标准化)是不正确的。
You can divide all scores with the maximum score to get scores between 0 and 1.
However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.
没有好的标准方法可以使用 lucene 标准化分数。阅读此内容:ScoresAsPercentages 和此解释
在您的情况下,如果结果按分数排序,则最高分数是第一个结果的分数。但这个分数对于其他每个查询都会有所不同。
另请参阅 how-do-i-normalise-a-solr-lucene-得分
There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation
In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.
See also how-do-i-normalise-a-solr-lucene-score
Solr 中没有最高分数,它取决于太多变量,因此无法预测。
但是您可以实现称为标准化分数(分数百分比)的方法,但不建议这样做。
有关更多详细信息,请参阅相关链接:
是否可以“合理”地设置 Solr 分数阈值,而与返回的结果无关? (即 Solr 评分是否以任何方式标准化)
如何做我标准化了 solr/lucene 分数吗?
在 Solr/Lucene 中删除低于特定分数阈值的结果?
There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.
But you can implement something called normalized score (Scores As Percentages) which is not recommended.
See related links for more details:
Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
how do I normalise a solr/lucene score?
Remove results below a certain score threshold in Solr/Lucene?
常规标准化只会帮助您比较查询(及其检索到的列表)之间的评分分布。
您不能简单地标准化分数来比较查询之间的性能。
考虑一个查询,其中所有检索到的文档都高度相关并获得相同的(高分),并且在另一个查询中,检索到的列表包含大麦相关文档(同样,具有相同的分数) - 现在,无论每个查询的规范化如何你所做的 - 标准化分数将是相同的。
您需要考虑一个可以使所有分数达到同一水平的交叉查询因素。
例如 - 也许计算查询和整个索引之间的相似性,并以某种方式将该分数与文档分数一起使用
A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists).
You cannot simply normalize the score to compare the performance between queries.
Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.
You need to think on a cross-query factor that can bring all the scores to the same level.
For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score
如果您想比较两个或多个查询,我找到了一个解决方法。
您可以使用 LevenstheinDistance 或 LuceneLevenstheinDistance(Damerau) 类将得分最高的文档与查询词进行比较,以获取查询词与结果之间的距离。结果就是它们之间的相似性。对您想要比较的每个查询执行此操作。现在,您有一个工具可以使用 querytherm 和最高结果的相似性来比较您的查询。您现在可以选择相似度最高的查询,并将其用于下一步正确的操作。
If you want to compare two or more queries, i found an workaround.
You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.
我应用了非线性函数来压缩每个查询。
I applied a non-linearity function in order to compress every queries.