solr多值字段的评分

发布于 2025-01-05 01:20:11 字数 417 浏览 1 评论 0原文

如果我在 Solr 中有一个包含多值字段的文档,那么多个值是独立评分还是只是串联并作为一个大字段评分?我希望他们能够独立评分。这是我的意思的一个例子:

我有一个文档,其中包含一个人名字段,其中同一个人可能有多个姓名。这些名字都不同(在某些情况下非常不同),但它们都是同一个人/文档。

第 1 个人: 大卫·鲍伊、大卫·罗伯特·琼斯、Ziggy Stardust、Thin White Duke

第二个人: 大卫莱特曼

第三人: David Hasselhoff, David Michael Hasselhoff

如果我要搜索“David”,我希望所有这些都有大约相同的匹配机会。如果每个名字都是独立评分的,情况似乎就是这样。如果它们只是作为单个字段存储和搜索,大卫·鲍伊将因拥有比其他人更多的令牌而受到惩罚。 Solr 如何处理这种情况?

If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:

I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.

Person 1:
David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke

Person 2:
David Letterman

Person 3:
David Hasselhoff, David Michael Hasselhoff

If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

生来就爱笑 2025-01-12 01:20:11

您只需使用 debugQuery=on 运行查询 q=field_name:David 并查看会发生什么。

这些是按 score desc 排序的结果(包括通过 fl=*,score 得出的分数):

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

解释如下:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

这里的评分因素是:

  • termFreq :某个术语在文档中出现的频率
  • idf:该术语在索引中出现的频率
  • fieldNorm:该术语的重要性,取决于索引时间提升和字段长度

在您的示例中fieldNorm 有所不同。您有一个 termFreq 较低的文档(1 而不是 1.4142135),因为该术语仅出现一次,但由于字段长度,该匹配更为重要。

您的字段是多值的这一事实不会改变评分。我想对于具有相同内容的单个值字段来说也是一样的。 Solr 根据字段长度和术语进行工作,因此,是的,David Bowie 因拥有比其他人更多的标记而受到惩罚。 :)

更新
事实上,我认为大卫·鲍伊值得得到这个机会。正如上面所解释的,fieldNorm 产生了差异。将属性 omitNorms=true 添加到 schema.xml 中的 text_ws 字段并重新索引。相同的查询将给出以下结果:

<doc>
    <float name="score">1.0073696</float>
    <str name="id">1</str>
    <arr name="text">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>
<doc>
    <float name="score">1.0073696</float>
    <str name="id">3</str>
    <arr name="text">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.71231794</float>
    <str name="id">2</str>
    <arr name="text">
        <str>David Letterman</str>
    </arr>
</doc>

如您所见,现在 termFreq 获胜,而 fieldNorm 根本不被考虑。这就是为什么出现两次 David 的两个文档尽管长度不同,但仍位于顶部且得分相同,而仅包含一次匹配的较短文档是得分最低的最后一个文档。以下是 debugQuery=on 的解释:

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>

You can just run your query q=field_name:David with debugQuery=on and see what happens.

These are the results (included the score through fl=*,score) sorted by score desc:

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

And this is the explanation:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

The scoring factors here are:

  • termFreq: how often a term appears in the document
  • idf: how often the term appears across the index
  • fieldNorm: importance of the term, depending on index-time boosting and field length

In your example the fieldNorm makes the difference. You have one document with lower termFreq (1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.

The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)

UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the fieldNorm makes the difference. Add the attribute omitNorms=true to your text_ws field in the schema.xml and reindex. The same query will give you the following result:

<doc>
    <float name="score">1.0073696</float>
    <str name="id">1</str>
    <arr name="text">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>
<doc>
    <float name="score">1.0073696</float>
    <str name="id">3</str>
    <arr name="text">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.71231794</float>
    <str name="id">2</str>
    <arr name="text">
        <str>David Letterman</str>
    </arr>
</doc>

As you can see now the termFreq wins and the fieldNorm is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation with debugQuery=on:

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>
╰◇生如夏花灿烂 2025-01-12 01:20:11

您可以使用 Lucenes SweetSpotSimilarity 来定义长度的平台,其范数均应为 1.0。只要您正在搜索名称等内容,这可以帮助您解决您的情况。 lengthNorm 没有任何好处。

you could use Lucenes SweetSpotSimilarity to define the plateau of lengths that should all have a norm of 1.0. this could help you with your situation as long as you are searching for stuff like names etc. lengthNorm doesn't do any good.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文