Elasticsearch在相同索引中提供不同的TF-IDF分数
我目前正在尝试设置由几个匹配模糊查询制成的组合查询。我注意到一些我想在进行查询之前要有的解释。 我的文档索引如单个索引text
中的以下内容:
{
"article": "someArticleName",
"articleInfo": "someInfo", // potentially missing if this matters
"userId": 2
}
如果我运行以下查询:
{
"from":0,
"min_score":0.6,
"query":{
"bool":{
"filter":[
{"term":{"userId":{"value": 2}}}
],
"should": {"match":{"article":{"fuzziness":"AUTO","query":"1705aa"}}}
}
},
"size":20,
"sort":[{"_score":{"order":"desc"}}],
"explain": True
}
然后我会以此为结果:
{'took': 16,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 6, 'relation': 'eq'},
'max_score': 3.5664783,
'hits': [{'_index': 'text',
'_type': '_doc',
'_id': 'id-1',
'_score': 3.5664783,
'_source': {'id': 'id-1',
'article': '1705aa',
'indexName': 'text',
'currentVersion': 0,
'userId': 2,
'indexedUtc': '2022-05-23T07:47:48.6175402+00:00'}},
{'_index': 'text',
'_type': '_doc',
'_id': 'id-2',
'_score': 1.3915253,
'_source': {'id': 'id-2',
'article': '1705aa',
'articleInfo': 'someInfo',
'userId': 2,
'indexedUtc': '2022-05-23T09:57:11.8080429+00:00'}},
...
}
这是一个说明:
{'description': 'sum of:',
'details': [{'description': 'weight(article:1705aa in 220) '
'[PerFieldSimilarity], result of:',
'details': [{'description': 'score(freq=1.0), computed as boost '
'* idf * tf from:',
'details': [{'description': 'boost',
'details': [],
'value': 2.2},
{'description': 'idf, computed as log(1 '
'+ (N - n + 0.5) / (n + '
'0.5)) from:',
'details': [{'description': 'n, number '
'of '
'documents '
'containing '
'term',
'details': [],
'value': 1},
{'description': 'N, total '
'number of '
'documents '
'with '
'field',
'details': [],
'value': 37}],
'value': 3.232121},
{'description': 'tf, computed as freq / '
'(freq + k1 * (1 - b + '
'b * dl / avgdl)) from:',
'details': [{'description': 'freq, '
'occurrences '
'of term '
'within '
'document',
'details': [],
'value': 1.0},
{'description': 'k1, term '
'saturation '
'parameter',
'details': [],
'value': 1.2},
{'description': 'b, length '
'normalization '
'parameter',
'details': [],
'value': 0.75},
{'description': 'dl, '
'length of '
'field',
'details': [],
'value': 1.0},
{'description': 'avgdl, '
'average '
'length of '
'field',
'details': [],
'value': 1.2972972}],
'value': 0.50156736}],
'value': 3.5664783}],
'value': 3.5664783},
{'description': 'match on required clause, product of:',
'details': [{'description': '# clause',
'details': [],
'value': 0.0},
{'description': 'userId:[2 TO 2]',
'details': [],
'value': 1.0}],
'value': 0.0}],
'value': 3.5664783}
{'currentVersion': 0,
'id': 'id-1',
'indexName': 'text',
'indexedUtc': '2022-05-23T07:47:48.6175402+00:00',
'article': '1705aa',
'userId': 2}
{'description': 'sum of:',
'details': [{'description': 'weight(article:1705aa in 0) '
'[PerFieldSimilarity], result of:',
'details': [{'description': 'score(freq=1.0), computed as boost '
'* idf * tf from:',
'details': [{'description': 'boost',
'details': [],
'value': 2.2},
{'description': 'idf, computed as log(1 '
'+ (N - n + 0.5) / (n + '
'0.5)) from:',
'details': [{'description': 'n, number '
'of '
'documents '
'containing '
'term',
'details': [],
'value': 5},
{'description': 'N, total '
'number of '
'documents '
'with '
'field',
'details': [],
'value': 20}],
'value': 1.3397744},
{'description': 'tf, computed as freq / '
'(freq + k1 * (1 - b + '
'b * dl / avgdl)) from:',
'details': [{'description': 'freq, '
'occurrences '
'of term '
'within '
'document',
'details': [],
'value': 1.0},
{'description': 'k1, term '
'saturation '
'parameter',
'details': [],
'value': 1.2},
{'description': 'b, length '
'normalization '
'parameter',
'details': [],
'value': 0.75},
{'description': 'dl, '
'length of '
'field',
'details': [],
'value': 1.0},
{'description': 'avgdl, '
'average '
'length of '
'field',
'details': [],
'value': 1.1}],
'value': 0.472103}],
'value': 1.3915253}],
'value': 1.3915253},
{'description': 'match on required clause, product of:',
'details': [{'description': '# clause',
'details': [],
'value': 0.0},
{'description': 'userId:[2 TO 2]',
'details': [],
'value': 1.0}],
'value': 0.0}],
'value': 1.3915253}
{'currentVersion': 0,
'id': 'id-2',
'indexName': 'text',
'indexedUtc': '2022-05-23T09:57:11.8080429+00:00',
'article': ' 1705aa ',
'articleInfo': 'someInfo'
'userId': 2}
为什么我会得到第一个文档的3.5664783
1.3915253
第二个文档?
它们位于相同的索引中,并且都是模糊查询的确切匹配。 _解释
中使用的文档数量似乎有所不同,我不明白为什么以及如何获得两个文档的同等分数。
I'm currently trying to setup a combined query made of several match fuzzy queries. I noticed something I'd like to have an explanation for before I move to combining queries.
I have documents indexed such as the following in a single index text
:
{
"article": "someArticleName",
"articleInfo": "someInfo", // potentially missing if this matters
"userId": 2
}
If I run the following query:
{
"from":0,
"min_score":0.6,
"query":{
"bool":{
"filter":[
{"term":{"userId":{"value": 2}}}
],
"should": {"match":{"article":{"fuzziness":"AUTO","query":"1705aa"}}}
}
},
"size":20,
"sort":[{"_score":{"order":"desc"}}],
"explain": True
}
then I receive this as result:
{'took': 16,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 6, 'relation': 'eq'},
'max_score': 3.5664783,
'hits': [{'_index': 'text',
'_type': '_doc',
'_id': 'id-1',
'_score': 3.5664783,
'_source': {'id': 'id-1',
'article': '1705aa',
'indexName': 'text',
'currentVersion': 0,
'userId': 2,
'indexedUtc': '2022-05-23T07:47:48.6175402+00:00'}},
{'_index': 'text',
'_type': '_doc',
'_id': 'id-2',
'_score': 1.3915253,
'_source': {'id': 'id-2',
'article': '1705aa',
'articleInfo': 'someInfo',
'userId': 2,
'indexedUtc': '2022-05-23T09:57:11.8080429+00:00'}},
...
}
and this as an explanation:
{'description': 'sum of:',
'details': [{'description': 'weight(article:1705aa in 220) '
'[PerFieldSimilarity], result of:',
'details': [{'description': 'score(freq=1.0), computed as boost '
'* idf * tf from:',
'details': [{'description': 'boost',
'details': [],
'value': 2.2},
{'description': 'idf, computed as log(1 '
'+ (N - n + 0.5) / (n + '
'0.5)) from:',
'details': [{'description': 'n, number '
'of '
'documents '
'containing '
'term',
'details': [],
'value': 1},
{'description': 'N, total '
'number of '
'documents '
'with '
'field',
'details': [],
'value': 37}],
'value': 3.232121},
{'description': 'tf, computed as freq / '
'(freq + k1 * (1 - b + '
'b * dl / avgdl)) from:',
'details': [{'description': 'freq, '
'occurrences '
'of term '
'within '
'document',
'details': [],
'value': 1.0},
{'description': 'k1, term '
'saturation '
'parameter',
'details': [],
'value': 1.2},
{'description': 'b, length '
'normalization '
'parameter',
'details': [],
'value': 0.75},
{'description': 'dl, '
'length of '
'field',
'details': [],
'value': 1.0},
{'description': 'avgdl, '
'average '
'length of '
'field',
'details': [],
'value': 1.2972972}],
'value': 0.50156736}],
'value': 3.5664783}],
'value': 3.5664783},
{'description': 'match on required clause, product of:',
'details': [{'description': '# clause',
'details': [],
'value': 0.0},
{'description': 'userId:[2 TO 2]',
'details': [],
'value': 1.0}],
'value': 0.0}],
'value': 3.5664783}
{'currentVersion': 0,
'id': 'id-1',
'indexName': 'text',
'indexedUtc': '2022-05-23T07:47:48.6175402+00:00',
'article': '1705aa',
'userId': 2}
{'description': 'sum of:',
'details': [{'description': 'weight(article:1705aa in 0) '
'[PerFieldSimilarity], result of:',
'details': [{'description': 'score(freq=1.0), computed as boost '
'* idf * tf from:',
'details': [{'description': 'boost',
'details': [],
'value': 2.2},
{'description': 'idf, computed as log(1 '
'+ (N - n + 0.5) / (n + '
'0.5)) from:',
'details': [{'description': 'n, number '
'of '
'documents '
'containing '
'term',
'details': [],
'value': 5},
{'description': 'N, total '
'number of '
'documents '
'with '
'field',
'details': [],
'value': 20}],
'value': 1.3397744},
{'description': 'tf, computed as freq / '
'(freq + k1 * (1 - b + '
'b * dl / avgdl)) from:',
'details': [{'description': 'freq, '
'occurrences '
'of term '
'within '
'document',
'details': [],
'value': 1.0},
{'description': 'k1, term '
'saturation '
'parameter',
'details': [],
'value': 1.2},
{'description': 'b, length '
'normalization '
'parameter',
'details': [],
'value': 0.75},
{'description': 'dl, '
'length of '
'field',
'details': [],
'value': 1.0},
{'description': 'avgdl, '
'average '
'length of '
'field',
'details': [],
'value': 1.1}],
'value': 0.472103}],
'value': 1.3915253}],
'value': 1.3915253},
{'description': 'match on required clause, product of:',
'details': [{'description': '# clause',
'details': [],
'value': 0.0},
{'description': 'userId:[2 TO 2]',
'details': [],
'value': 1.0}],
'value': 0.0}],
'value': 1.3915253}
{'currentVersion': 0,
'id': 'id-2',
'indexName': 'text',
'indexedUtc': '2022-05-23T09:57:11.8080429+00:00',
'article': ' 1705aa ',
'articleInfo': 'someInfo'
'userId': 2}
Why do I get a score of 3.5664783
for the first document and 1.3915253
for the second?
They're located in the same index and are both an exact match of the fuzzy query. Number of documents used in _explanation
seem different and I don't understand why and how to get equal scores for both documents.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
响应表明您有5个碎片。
碎片会影响相关得分。您的文档将分布在您的碎片之间,默认情况下,Elasticsearch使每个碎片都负责产生自己的分数。
因此,这两个结果中说明中使用的文档数量都不同。
阅读更多有关碎片对评分的影响此处和
默认情况下,Elasticsearch将使用称为查询的搜索类型,然后获取。其工作方式如下:
频率
尚未发送,只有
最终,根据查询标准选择的是
结果将返回客户端
以获得更一致的分数,您可以使用 dfs查询,然后在这样的搜索查询中获取
进行以下操作:
从前传中计算出的频率。
尚未发送,只有
最终,根据查询标准选择的是
结果将返回给客户
The response shows that you have 5 shards.
Shards impact the relevance scoring. Your documents will be distributed among your shards and by default, Elasticsearch makes each shard responsible for producing its own scores.
Hence, the number of documents used in the explanation is different for both the results.
Read more about impact of shards on scoring here and here
By default, Elasticsearch will use a search type called Query Then Fetch. The way it works is as follows:
Frequencies
is not sent yet, just the scores
selected according to query criteria
Results are returned to the client
To get more consistent score you can use DFS Query Then Fetch with your search query like this
It does the following:
Frequencies calculated from the prequery.
is not sent yet, just the scores
selected according to query criteria
Results are returned to the client