是否可以“合理”地设置 Solr 分数阈值,而与返回的结果无关? (即 Solr 评分是否以任何方式标准化)
我有一个包含许多条目的 Solr 索引,并且在查询时返回一些子集 - 每个条目都有一些分数(显而易见)。一旦结果与分数一起返回,我希望能够仅“保留”高于某个分数的结果(即仅具有特定质量的结果)。当返回的子集可以是任何东西时是否可以这样做?
我问这个问题是因为在某些查询中,0.008 的分数似乎会导致良好的匹配,而其他查询则较高的分数会导致较差的匹配。
理想情况下,我只是在寻找一种方法来获取前 x
条目,只要它们至少具有一定的质量。
I have a Solr index with many entries, and upon query some subset is returned - each entry having some score, (Obvious). Once the results are returned with scores, I want to be able to only "keep" results that are above some score (i.e. results of a certain quality only). Is it possible to do this when the returned subset could be anything?
I ask because it seems like on some queries a score of say 0.008 is resulting in a decent match, whereas other queries a higher score results in a poor match.
Ideally I'm just looking for a method to take the top x
entries as long as they are of at least a certain quality.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为你不应该这样做。使用 TF-IDF 评分模型,无法计算出高于该分数的所有结果都相关的分数,反之亦然。如果您设法做到这一点,那么在对索引进行几次更新后,该阈值很可能将不再有效(因为文档频率会发生变化)。
如果您仍然想这样做,我认为可以使用函数查询来实现:Solr 中有一个
if
(在主干中)和一个query
函数。只需过滤您的结果,以便仅保留分数高于给定阈值的条目。I think you should not do this. With the TF-IDF scoring model, there is no way to compute a score above which all results are relevant and vice-versa. And if you manage to do this, it is very likely that this threshold will not be valid anymore after a few updates to your index (because document frequencies will change).
If you still want to do this, I think it is achievable using function queries : there are a
if
(in trunk), and aquery
functions available in Solr. Just filter your results so that you only keep entries which have a higher score than a given threshold.还想先浏览 ScoresAsPercentages 。
Solr 不会标准化分数,因为它可以在客户端轻松完成。
您可以使用结果中提供的 maxScore,将所有分数除以
最大分数。
第一个记录的得分为 1,然后是其余记录。
Would also like to go through ScoresAsPercentages first.
Solr does not normalize scores since it may be easily done at the client side.
you can use the maxScore which is provided in the results, by dividing all scores by
maxScore.
The first record will have the score of one followed by the rest.