版本之间的索引/搜索算法稳定性

发布于 2025-01-19 09:38:14 字数 3772 浏览 2 评论 0原文

我正在从Elasticsearch 1.5迁移到7.10有多个必需的更改,最相关的更改是删除版本6中的文档类型概念,以处理它引入了一个新的字段doc_type,然后在搜索时与之匹配。 我的问题是,当我进行相同的(或同等因素,因为有一些更改)时,我应该期望获得完全相同的结果集吗?因为我有一些差异,所以我想弄清楚是在新映射中或搜索查询中打破了某些内容。 在第一个问题之后提前

编辑:

一般:我有一项与ES 1.5通信的服务,我必须将其迁移到ES 7.10使外部API尽可能稳定。

  • 我不使用得分。
  • 以前,我有文档类型ab,例如:迁移后,我将ab保存在doc_type中,并且查询变为host/indexName/_search with a “ bool”:{“应该”:[{“ enter”:{“ doc_type”:[a a'],“ boost”:1.0}},{“ tenter”:{“ doc_type”:[b “],“ boost”:1.0}}],“ aptim_pure_negative”:true,“ boost”:1.0}在身体中。如果我将其放在ab的不同索引中我不知道我应该遵循哪种策略,因此将其全部保持在一起,我会从ES中获得混合(doc_type)的响应。我遵循这种特定方法 https> https:// wwwwwwwwww .elastic.co/blog/删除映像型型 - elasticsearch#custom-type-field
  • 差异不是很大,很难显示具体示例,因为它是一个复杂的数据/文档结构,而是这个想法是,对于1.5此响应以进行给出查询: [a,b,c,d,e,f,g,h,i,j](每个人都可以具有任何类型a) 在7.10的情况下,我有这样的答复: [A,B,E,C,D,F,G,H,I,J][a,b,c,d,e,e,g,i,i,i,j,k,k,k ]

第二次编辑: 此查询是从Java客户端生成的。

{
   "from":0,
   "size":100,
   "query":{
      "bool":{
         "must":[
            {
               "query_string":{
                  "query":"mark_deleted:false",
                  "fields":[
                     
                  ],
                  "type":"best_fields",
                  "default_operator":"or",
                  "max_determinized_states":10000,
                  "enable_position_increments":true,
                  "fuzziness":"AUTO",
                  "fuzzy_prefix_length":0,
                  "fuzzy_max_expansions":50,
                  "phrase_slop":0,
                  "escape":false,
                  "auto_generate_synonyms_phrase_query":true,
                  "fuzzy_transpositions":true,
                  "boost":1.0
               }
            },
            {
               "bool":{
                  "should":[
                     {
                        "terms":{
                           "type":[
                              "A"
                           ],
                           "boost":1.0
                        }
                     },
                     {
                        "terms":{
                           "type":[
                              "B"
                           ],
                           "boost":1.0
                        }
                     },
                     {
                        "terms":{
                           "type":[
                              "D"
                           ],
                           "boost":1.0
                        }
                     }
                  ],
                  "adjust_pure_negative":true,
                  "boost":1.0
               }
            }
         ],
         "adjust_pure_negative":true,
         "boost":1.0
      }
   },
   "post_filter":{
      "term":{
         "mark_deleted":{
            "value":false,
            "boost":1.0
         }
      }
   },
   "sort":[
      {
         "a_specific_date":{
            "order":"desc"
         }
      }
   ],
   "highlight":{
      "pre_tags":[
         "<b>"
      ],
      "post_tags":[
         "</b>"
      ],
      "no_match_size":120,
      "fields":{
         "body":{
            "fragment_size":120,
            "number_of_fragments":1
         }
      }
   }
}

I'm migrating from Elasticsearch 1.5 to 7.10 there are multiple required changes, the most relevant one is the removal of the document type concept in version 6, to deal with it I introduced a new field doc_type and then I match with it when I search.
My question is, when I make the same (or equivalent because there are some changes) search query should I expect to have the exact same result set? Because I'm having some differences, so I would like to figure out if I broke something in the new mappings or in the search query.
Thank you in advance

Edit after first question:

In general: I have a service that communicates with ES 1.5 and I have to migrate it to ES 7.10 keeping the external API as stable as possible.

  • I'm not using scoring.
  • Previously I had document types A and B, when I make a query like this for example: host/indexname/A,B/_search, after the migration I keep A or B in doc_type, and the query becomes host/indexname/_search with a "bool":{"should":[{"terms":{"doc_type":["A"],"boost":1.0}},{"terms":{"doc_type":["B"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0} in the body. If I put it in different indexes for A and B and the user want to match in both of them I'll have to "merge" the search response for both queries and I don't know which strategy should I follow for that, so keeping it all together I get a response with mixed (doc_type) results from ES. I followed this specific approach https://www.elastic.co/blog/removal-of-mapping-types-elasticsearch#custom-type-field
  • The differences are not so big, difficult to show a concrete example because it's a complex data/doc structure but the idea is, having for 1.5 this response for a giving query for example:
    [a, b, c, d, e, f, g, h, i, j] (where each one may have any of types A or B)
    With 7.10 I'm having responses like:
    [a, b, e, c, d, f, g, h, i, j] or [a, b, c, d, e, g, i, j, k]

Second edit:
This query has been generated from the java client.

{
   "from":0,
   "size":100,
   "query":{
      "bool":{
         "must":[
            {
               "query_string":{
                  "query":"mark_deleted:false",
                  "fields":[
                     
                  ],
                  "type":"best_fields",
                  "default_operator":"or",
                  "max_determinized_states":10000,
                  "enable_position_increments":true,
                  "fuzziness":"AUTO",
                  "fuzzy_prefix_length":0,
                  "fuzzy_max_expansions":50,
                  "phrase_slop":0,
                  "escape":false,
                  "auto_generate_synonyms_phrase_query":true,
                  "fuzzy_transpositions":true,
                  "boost":1.0
               }
            },
            {
               "bool":{
                  "should":[
                     {
                        "terms":{
                           "type":[
                              "A"
                           ],
                           "boost":1.0
                        }
                     },
                     {
                        "terms":{
                           "type":[
                              "B"
                           ],
                           "boost":1.0
                        }
                     },
                     {
                        "terms":{
                           "type":[
                              "D"
                           ],
                           "boost":1.0
                        }
                     }
                  ],
                  "adjust_pure_negative":true,
                  "boost":1.0
               }
            }
         ],
         "adjust_pure_negative":true,
         "boost":1.0
      }
   },
   "post_filter":{
      "term":{
         "mark_deleted":{
            "value":false,
            "boost":1.0
         }
      }
   },
   "sort":[
      {
         "a_specific_date":{
            "order":"desc"
         }
      }
   ],
   "highlight":{
      "pre_tags":[
         "<b>"
      ],
      "post_tags":[
         "</b>"
      ],
      "no_match_size":120,
      "fields":{
         "body":{
            "fragment_size":120,
            "number_of_fragments":1
         }
      }
   }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

回眸一遍 2025-01-26 09:38:14

首先,由于您不关心评分,因此应该在顶层使用 bool/filter 而不是 bool/must,否则您的结果将按 _score 排序 默认情况下,在 1.7 和 7.10 之间,发生了很多变化,这可以解释您所得到的差异。因此,您最好简单地使用 _score 之外的任何其他字段对结果进行排序

其次,您可以使用 type 上的 bool/should一个简单的 terms 查询,它执行完全相同的工作,但以更简单的方式:

{
  "from": 0,
  "size": 100,
  "query": {
    "bool": {
      "filter": [
        {
          "query_string": {
            "query": "mark_deleted:false",
            "fields": [],
            "type": "best_fields",
            "default_operator": "or",
            "max_determinized_states": 10000,
            "enable_position_increments": true,
            "fuzziness": "AUTO",
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "phrase_slop": 0,
            "escape": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_transpositions": true,
            "boost": 1
          }
        },
        {
          "terms": {
            "type": [
              "A",
              "B",
              "C"
            ]
          }
        }
      ]
    }
  },
  "post_filter": {
    "term": {
      "mark_deleted": {
        "value": false,
        "boost": 1
      }
    }
  },
  "sort": [
    {
      "a_specific_date": {
        "order": "desc"
      }
    }
  ],
  "highlight": {
    "pre_tags": [
      "<b>"
    ],
    "post_tags": [
      "</b>"
    ],
    "no_match_size": 120,
    "fields": {
      "body": {
        "fragment_size": 120,
        "number_of_fragments": 1
      }
    }
  }
}

最后,我不确定为什么您使用 query_string 查询来执行精确匹配mark_deleted:false,这对我来说没有意义。一个简单的 term 查询在这里会更好、更充分。

也不清楚为什么您删除了 post_filter 中也有 mark_deleted:false 的所有结果,因为它与您的 query_string 约束中的条件相同。

First, since you don't care about scoring you should use bool/filter instead of bool/must at the top level, otherwise your results are sorted by _score by default and between 1.7 et 7.10, there have been so many changes that it would explain the differences you get. So you're better off simply sorting the results using any other field than _score

Second, instead of the bool/should on type you can use a simple terms query, which does exactly the same job, yet in a simpler way:

{
  "from": 0,
  "size": 100,
  "query": {
    "bool": {
      "filter": [
        {
          "query_string": {
            "query": "mark_deleted:false",
            "fields": [],
            "type": "best_fields",
            "default_operator": "or",
            "max_determinized_states": 10000,
            "enable_position_increments": true,
            "fuzziness": "AUTO",
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "phrase_slop": 0,
            "escape": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_transpositions": true,
            "boost": 1
          }
        },
        {
          "terms": {
            "type": [
              "A",
              "B",
              "C"
            ]
          }
        }
      ]
    }
  },
  "post_filter": {
    "term": {
      "mark_deleted": {
        "value": false,
        "boost": 1
      }
    }
  },
  "sort": [
    {
      "a_specific_date": {
        "order": "desc"
      }
    }
  ],
  "highlight": {
    "pre_tags": [
      "<b>"
    ],
    "post_tags": [
      "</b>"
    ],
    "no_match_size": 120,
    "fields": {
      "body": {
        "fragment_size": 120,
        "number_of_fragments": 1
      }
    }
  }
}

Finally, I'm not sure why you're using a query_string query to do an exact match on mark_deleted:false, it doesn't make sense to me. A simple term query would be better and more adequate here.

Also not clear why you have remove all results that also have mark_deleted:false in your post_filter, since it's the same condition as in your query_string constraint.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文