elasticsearch 同义词导致start_offset改变是怎么回事?

发布于 09-12 03:29 字数 3403 浏览 23 评论 0

设置的同义词如下:

托尼-克罗斯=>托尼-克罗斯,克罗斯,托尼克罗斯,托尼,tk

index setting如下:

{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonym": {
            "type": "synonym",
            "synonyms_path": "my_synonym.txt",
            "lenient": "true"
          }
        },
        "analyzer": {
          "my_ik_analyzer": {
            "filter": [
              "my_synonym"
            ],
            "type": "custom",
            "tokenizer": "my_ik_token"
          }
        },
        "tokenizer": {
          "my_ik_token": {
            "type": "ik_max_word"
          }
        }
      }
    }
  }
}

tokenizer(my_ik_token)分词托尼-克罗斯结果为

{  
    "tokens":[  
        {  
            "token":"托尼",  
            "start_offset":0,  
            "end_offset":2,  
            "type":"CN_WORD",  
            "position":0  
        },  
        {  
            "token":"克罗斯",  
            "start_offset":3,  
            "end_offset":6,  
            "type":"CN_WORD",  
            "position":1  
        },  
        {  
            "token":"罗斯",  
            "start_offset":4,  
            "end_offset":6,  
            "type":"CN_WORD",  
            "position":2  
        }  
    ]  
}

加上了synonym filteranalyzer(my_ik_analyzer)分词结果为:

{
    "tokens": [
        {
            "token": "托尼",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "克罗斯",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "托尼",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "托尼",
            "start_offset": 0,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "tk",
            "start_offset": 0,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "克罗斯",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "罗斯",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "尼克",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "罗斯",
            "start_offset": 4,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "克罗斯",
            "start_offset": 4,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "罗斯",
            "start_offset": 4,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 3
        }
    ]
}

可以看到克罗斯出现了两次,其中有一次的start_offsetend_offset是错误的。

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文