Elasticsearch在相同索引中提供不同的TF-IDF分数

发布于 2025-01-31 09:23:00 字数 11534 浏览 0 评论 0原文

我目前正在尝试设置由几个匹配模糊查询制成的组合查询。我注意到一些我想在进行查询之前要有的解释。我的文档索引如单个索引text中的以下内容：

{
    "article": "someArticleName",
    "articleInfo": "someInfo", // potentially missing if this matters
    "userId": 2
}

如果我运行以下查询：

{
    "from":0,
    "min_score":0.6,
    "query":{
        "bool":{
            "filter":[
                {"term":{"userId":{"value": 2}}}
            ],
            "should": {"match":{"article":{"fuzziness":"AUTO","query":"1705aa"}}}
        }
    },
    "size":20,
    "sort":[{"_score":{"order":"desc"}}],
    "explain": True
}

然后我会以此为结果：

{'took': 16,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 6, 'relation': 'eq'},
  'max_score': 3.5664783,
  'hits': [{'_index': 'text',
    '_type': '_doc',
    '_id': 'id-1',
    '_score': 3.5664783,
    '_source': {'id': 'id-1',
     'article': '1705aa',
     'indexName': 'text',
     'currentVersion': 0,
     'userId': 2,
     'indexedUtc': '2022-05-23T07:47:48.6175402+00:00'}},
   {'_index': 'text',
    '_type': '_doc',
    '_id': 'id-2',
    '_score': 1.3915253,
    '_source': {'id': 'id-2',
     'article': '1705aa',
     'articleInfo': 'someInfo',
     'userId': 2,
     'indexedUtc': '2022-05-23T09:57:11.8080429+00:00'}},
   ...
}

这是一个说明：

{'description': 'sum of:',
 'details': [{'description': 'weight(article:1705aa in 220) '
                             '[PerFieldSimilarity], result of:',
              'details': [{'description': 'score(freq=1.0), computed as boost '
                                          '* idf * tf from:',
                           'details': [{'description': 'boost',
                                        'details': [],
                                        'value': 2.2},
                                       {'description': 'idf, computed as log(1 '
                                                       '+ (N - n + 0.5) / (n + '
                                                       '0.5)) from:',
                                        'details': [{'description': 'n, number '
                                                                    'of '
                                                                    'documents '
                                                                    'containing '
                                                                    'term',
                                                     'details': [],
                                                     'value': 1},
                                                    {'description': 'N, total '
                                                                    'number of '
                                                                    'documents '
                                                                    'with '
                                                                    'field',
                                                     'details': [],
                                                     'value': 37}],
                                        'value': 3.232121},
                                       {'description': 'tf, computed as freq / '
                                                       '(freq + k1 * (1 - b + '
                                                       'b * dl / avgdl)) from:',
                                        'details': [{'description': 'freq, '
                                                                    'occurrences '
                                                                    'of term '
                                                                    'within '
                                                                    'document',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'k1, term '
                                                                    'saturation '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 1.2},
                                                    {'description': 'b, length '
                                                                    'normalization '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 0.75},
                                                    {'description': 'dl, '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'avgdl, '
                                                                    'average '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.2972972}],
                                        'value': 0.50156736}],
                           'value': 3.5664783}],
              'value': 3.5664783},
             {'description': 'match on required clause, product of:',
              'details': [{'description': '# clause',
                           'details': [],
                           'value': 0.0},
                          {'description': 'userId:[2 TO 2]',
                           'details': [],
                           'value': 1.0}],
              'value': 0.0}],
 'value': 3.5664783}
{'currentVersion': 0,
 'id': 'id-1',
 'indexName': 'text',
 'indexedUtc': '2022-05-23T07:47:48.6175402+00:00',
 'article': '1705aa',
 'userId': 2}
{'description': 'sum of:',
 'details': [{'description': 'weight(article:1705aa in 0) '
                             '[PerFieldSimilarity], result of:',
              'details': [{'description': 'score(freq=1.0), computed as boost '
                                          '* idf * tf from:',
                           'details': [{'description': 'boost',
                                        'details': [],
                                        'value': 2.2},
                                       {'description': 'idf, computed as log(1 '
                                                       '+ (N - n + 0.5) / (n + '
                                                       '0.5)) from:',
                                        'details': [{'description': 'n, number '
                                                                    'of '
                                                                    'documents '
                                                                    'containing '
                                                                    'term',
                                                     'details': [],
                                                     'value': 5},
                                                    {'description': 'N, total '
                                                                    'number of '
                                                                    'documents '
                                                                    'with '
                                                                    'field',
                                                     'details': [],
                                                     'value': 20}],
                                        'value': 1.3397744},
                                       {'description': 'tf, computed as freq / '
                                                       '(freq + k1 * (1 - b + '
                                                       'b * dl / avgdl)) from:',
                                        'details': [{'description': 'freq, '
                                                                    'occurrences '
                                                                    'of term '
                                                                    'within '
                                                                    'document',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'k1, term '
                                                                    'saturation '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 1.2},
                                                    {'description': 'b, length '
                                                                    'normalization '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 0.75},
                                                    {'description': 'dl, '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'avgdl, '
                                                                    'average '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.1}],
                                        'value': 0.472103}],
                           'value': 1.3915253}],
              'value': 1.3915253},
             {'description': 'match on required clause, product of:',
              'details': [{'description': '# clause',
                           'details': [],
                           'value': 0.0},
                          {'description': 'userId:[2 TO 2]',
                           'details': [],
                           'value': 1.0}],
              'value': 0.0}],
 'value': 1.3915253}
{'currentVersion': 0,
 'id': 'id-2',
 'indexName': 'text',
 'indexedUtc': '2022-05-23T09:57:11.8080429+00:00',
 'article': ' 1705aa ',
 'articleInfo': 'someInfo'
 'userId': 2}

为什么我会得到第一个文档的3.5664783 1.3915253第二个文档？
它们位于相同的索引中，并且都是模糊查询的确切匹配。 _解释中使用的文档数量似乎有所不同，我不明白为什么以及如何获得两个文档的同等分数。

原文

I'm currently trying to setup a combined query made of several match fuzzy queries. I noticed something I'd like to have an explanation for before I move to combining queries.
I have documents indexed such as the following in a single index text:

{
    "article": "someArticleName",
    "articleInfo": "someInfo", // potentially missing if this matters
    "userId": 2
}

If I run the following query:

{
    "from":0,
    "min_score":0.6,
    "query":{
        "bool":{
            "filter":[
                {"term":{"userId":{"value": 2}}}
            ],
            "should": {"match":{"article":{"fuzziness":"AUTO","query":"1705aa"}}}
        }
    },
    "size":20,
    "sort":[{"_score":{"order":"desc"}}],
    "explain": True
}

then I receive this as result:

{'took': 16,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 6, 'relation': 'eq'},
  'max_score': 3.5664783,
  'hits': [{'_index': 'text',
    '_type': '_doc',
    '_id': 'id-1',
    '_score': 3.5664783,
    '_source': {'id': 'id-1',
     'article': '1705aa',
     'indexName': 'text',
     'currentVersion': 0,
     'userId': 2,
     'indexedUtc': '2022-05-23T07:47:48.6175402+00:00'}},
   {'_index': 'text',
    '_type': '_doc',
    '_id': 'id-2',
    '_score': 1.3915253,
    '_source': {'id': 'id-2',
     'article': '1705aa',
     'articleInfo': 'someInfo',
     'userId': 2,
     'indexedUtc': '2022-05-23T09:57:11.8080429+00:00'}},
   ...
}

and this as an explanation:

{'description': 'sum of:',
 'details': [{'description': 'weight(article:1705aa in 220) '
                             '[PerFieldSimilarity], result of:',
              'details': [{'description': 'score(freq=1.0), computed as boost '
                                          '* idf * tf from:',
                           'details': [{'description': 'boost',
                                        'details': [],
                                        'value': 2.2},
                                       {'description': 'idf, computed as log(1 '
                                                       '+ (N - n + 0.5) / (n + '
                                                       '0.5)) from:',
                                        'details': [{'description': 'n, number '
                                                                    'of '
                                                                    'documents '
                                                                    'containing '
                                                                    'term',
                                                     'details': [],
                                                     'value': 1},
                                                    {'description': 'N, total '
                                                                    'number of '
                                                                    'documents '
                                                                    'with '
                                                                    'field',
                                                     'details': [],
                                                     'value': 37}],
                                        'value': 3.232121},
                                       {'description': 'tf, computed as freq / '
                                                       '(freq + k1 * (1 - b + '
                                                       'b * dl / avgdl)) from:',
                                        'details': [{'description': 'freq, '
                                                                    'occurrences '
                                                                    'of term '
                                                                    'within '
                                                                    'document',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'k1, term '
                                                                    'saturation '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 1.2},
                                                    {'description': 'b, length '
                                                                    'normalization '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 0.75},
                                                    {'description': 'dl, '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'avgdl, '
                                                                    'average '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.2972972}],
                                        'value': 0.50156736}],
                           'value': 3.5664783}],
              'value': 3.5664783},
             {'description': 'match on required clause, product of:',
              'details': [{'description': '# clause',
                           'details': [],
                           'value': 0.0},
                          {'description': 'userId:[2 TO 2]',
                           'details': [],
                           'value': 1.0}],
              'value': 0.0}],
 'value': 3.5664783}
{'currentVersion': 0,
 'id': 'id-1',
 'indexName': 'text',
 'indexedUtc': '2022-05-23T07:47:48.6175402+00:00',
 'article': '1705aa',
 'userId': 2}
{'description': 'sum of:',
 'details': [{'description': 'weight(article:1705aa in 0) '
                             '[PerFieldSimilarity], result of:',
              'details': [{'description': 'score(freq=1.0), computed as boost '
                                          '* idf * tf from:',
                           'details': [{'description': 'boost',
                                        'details': [],
                                        'value': 2.2},
                                       {'description': 'idf, computed as log(1 '
                                                       '+ (N - n + 0.5) / (n + '
                                                       '0.5)) from:',
                                        'details': [{'description': 'n, number '
                                                                    'of '
                                                                    'documents '
                                                                    'containing '
                                                                    'term',
                                                     'details': [],
                                                     'value': 5},
                                                    {'description': 'N, total '
                                                                    'number of '
                                                                    'documents '
                                                                    'with '
                                                                    'field',
                                                     'details': [],
                                                     'value': 20}],
                                        'value': 1.3397744},
                                       {'description': 'tf, computed as freq / '
                                                       '(freq + k1 * (1 - b + '
                                                       'b * dl / avgdl)) from:',
                                        'details': [{'description': 'freq, '
                                                                    'occurrences '
                                                                    'of term '
                                                                    'within '
                                                                    'document',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'k1, term '
                                                                    'saturation '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 1.2},
                                                    {'description': 'b, length '
                                                                    'normalization '
                                                                    'parameter',
                                                     'details': [],
                                                     'value': 0.75},
                                                    {'description': 'dl, '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.0},
                                                    {'description': 'avgdl, '
                                                                    'average '
                                                                    'length of '
                                                                    'field',
                                                     'details': [],
                                                     'value': 1.1}],
                                        'value': 0.472103}],
                           'value': 1.3915253}],
              'value': 1.3915253},
             {'description': 'match on required clause, product of:',
              'details': [{'description': '# clause',
                           'details': [],
                           'value': 0.0},
                          {'description': 'userId:[2 TO 2]',
                           'details': [],
                           'value': 1.0}],
              'value': 0.0}],
 'value': 1.3915253}
{'currentVersion': 0,
 'id': 'id-2',
 'indexName': 'text',
 'indexedUtc': '2022-05-23T09:57:11.8080429+00:00',
 'article': ' 1705aa ',
 'articleInfo': 'someInfo'
 'userId': 2}

Why do I get a score of 3.5664783 for the first document and 1.3915253 for the second?
They're located in the same index and are both an exact match of the fuzzy query. Number of documents used in _explanation seem different and I don't understand why and how to get equal scores for both documents.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陪我终i 2025-02-07 09:23:00

响应表明您有5个碎片。
碎片会影响相关得分。您的文档将分布在您的碎片之间，默认情况下，Elasticsearch使每个碎片都负责产生自己的分数。
因此，这两个结果中说明中使用的文档数量都不同。

阅读更多有关碎片对评分的影响此处和

默认情况下，Elasticsearch将使用称为查询的搜索类型，然后获取。其工作方式如下：

将查询发送到每个碎片
查找所有匹配文档并使用本地术语/文档计算得分
频率
建立了结果的优先级队列（从/到等等，分类，分页等）
返回有关结果的元数据到请求节点。注意，实际文档
尚未发送，只有
所有碎片的分数合并并在请求节点上排序，文档为
最终，根据查询标准选择的是
，实际文档是从居住的各个碎片中检索的。
结果将返回客户端

以获得更一致的分数，您可以使用 dfs查询，然后在这样的搜索查询中获取

GET /test_index/_search?search_type=dfs_query_then_fetch

进行以下操作：

预报每个碎片询问术语和文档频率
将查询发送到每个shard
查找所有匹配文档并使用全局术语/文档计算得分
从前传中计算出的频率。
建立结果的优先级队列（分类，与/到等）
返回有关结果的元数据。注意，实际文档
尚未发送，只有
所有碎片的分数合并并在请求节点上排序，文档为
最终，根据查询标准选择的是
，实际文档是从居住的各个碎片中检索的。
结果将返回给客户

The response shows that you have 5 shards.
Shards impact the relevance scoring. Your documents will be distributed among your shards and by default, Elasticsearch makes each shard responsible for producing its own scores.
Hence, the number of documents used in the explanation is different for both the results.

Read more about impact of shards on scoring here and here

By default, Elasticsearch will use a search type called Query Then Fetch. The way it works is as follows:

Send the query to each shard
Find all matching documents and calculate scores using local Term/Document
Frequencies
Build a priority queue of results (sort, pagination with from/to, etc)
Return metadata about the results to requesting node. Note, the actual document
is not sent yet, just the scores
Scores from all the shards are merged and sorted on the requesting node, docs are
selected according to query criteria
Finally, the actual docs are retrieved from individual shards where they reside.
Results are returned to the client

To get more consistent score you can use DFS Query Then Fetch with your search query like this

GET /test_index/_search?search_type=dfs_query_then_fetch

It does the following:

Prequery each shard asking about Term and Document frequencies
Send the query to each shard
Find all matching documents and calculate scores using global Term/Document
Frequencies calculated from the prequery.
Build a priority queue of results (sort, pagination with from/to, etc)
Return metadata about the results to requesting node. Note, the actual document
is not sent yet, just the scores
Scores from all the shards are merged and sorted on the requesting node, docs are
selected according to query criteria
Finally, the actual docs are retrieved from individual shards where they reside.
Results are returned to the client

回复收藏 0 原文

~没有更多了~