solr多值字段的评分
如果我在 Solr 中有一个包含多值字段的文档,那么多个值是独立评分还是只是串联并作为一个大字段评分?我希望他们能够独立评分。这是我的意思的一个例子:
我有一个文档,其中包含一个人名字段,其中同一个人可能有多个姓名。这些名字都不同(在某些情况下非常不同),但它们都是同一个人/文档。
第 1 个人: 大卫·鲍伊、大卫·罗伯特·琼斯、Ziggy Stardust、Thin White Duke
第二个人: 大卫莱特曼
第三人: David Hasselhoff, David Michael Hasselhoff
如果我要搜索“David”,我希望所有这些都有大约相同的匹配机会。如果每个名字都是独立评分的,情况似乎就是这样。如果它们只是作为单个字段存储和搜索,大卫·鲍伊将因拥有比其他人更多的令牌而受到惩罚。 Solr 如何处理这种情况?
If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:
I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.
Person 1:
David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke
Person 2:
David Letterman
Person 3:
David Hasselhoff, David Michael Hasselhoff
If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您只需使用
debugQuery=on
运行查询q=field_name:David
并查看会发生什么。这些是按
score desc
排序的结果(包括通过fl=*,score
得出的分数):解释如下:
这里的评分因素是:
在您的示例中
fieldNorm
有所不同。您有一个termFreq
较低的文档(1 而不是 1.4142135),因为该术语仅出现一次,但由于字段长度,该匹配更为重要。您的字段是多值的这一事实不会改变评分。我想对于具有相同内容的单个值字段来说也是一样的。 Solr 根据字段长度和术语进行工作,因此,是的,David Bowie 因拥有比其他人更多的标记而受到惩罚。 :)
更新
事实上,我认为大卫·鲍伊值得得到这个机会。正如上面所解释的,
fieldNorm
产生了差异。将属性omitNorms=true
添加到schema.xml
中的text_ws
字段并重新索引。相同的查询将给出以下结果:如您所见,现在
termFreq
获胜,而fieldNorm
根本不被考虑。这就是为什么出现两次 David 的两个文档尽管长度不同,但仍位于顶部且得分相同,而仅包含一次匹配的较短文档是得分最低的最后一个文档。以下是debugQuery=on
的解释:You can just run your query
q=field_name:David
withdebugQuery=on
and see what happens.These are the results (included the score through
fl=*,score
) sorted byscore desc
:And this is the explanation:
The scoring factors here are:
In your example the
fieldNorm
makes the difference. You have one document with lowertermFreq
(1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)
UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the
fieldNorm
makes the difference. Add the attributeomitNorms=true
to yourtext_ws
field in theschema.xml
and reindex. The same query will give you the following result:As you can see now the
termFreq
wins and thefieldNorm
is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation withdebugQuery=on
:您可以使用 Lucenes SweetSpotSimilarity 来定义长度的平台,其范数均应为 1.0。只要您正在搜索名称等内容,这可以帮助您解决您的情况。 lengthNorm 没有任何好处。
you could use Lucenes SweetSpotSimilarity to define the plateau of lengths that should all have a norm of 1.0. this could help you with your situation as long as you are searching for stuff like names etc. lengthNorm doesn't do any good.