当前位置：文江博客话题详情

Linux Python GO Elasticsearch

elasticsearch 同义词导致start_offset改变是怎么回事？

发布于 09-12 03:29 字数 3403 浏览 23 评论 0

设置的同义词如下：

托尼-克罗斯=>托尼-克罗斯,克罗斯,托尼克罗斯,托尼,tk

index setting如下：

{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonym": {
            "type": "synonym",
            "synonyms_path": "my_synonym.txt",
            "lenient": "true"
          }
        },
        "analyzer": {
          "my_ik_analyzer": {
            "filter": [
              "my_synonym"
            ],
            "type": "custom",
            "tokenizer": "my_ik_token"
          }
        },
        "tokenizer": {
          "my_ik_token": {
            "type": "ik_max_word"
          }
        }
      }
    }
  }
}

tokenizer(my_ik_token)分词托尼-克罗斯结果为

{  
    "tokens":[  
        {  
            "token":"托尼",  
            "start_offset":0,  
            "end_offset":2,  
            "type":"CN_WORD",  
            "position":0  
        },  
        {  
            "token":"克罗斯",  
            "start_offset":3,  
            "end_offset":6,  
            "type":"CN_WORD",  
            "position":1  
        },  
        {  
            "token":"罗斯",  
            "start_offset":4,  
            "end_offset":6,  
            "type":"CN_WORD",  
            "position":2  
        }  
    ]  
}

加上了synonym filter的analyzer（my_ik_analyzer）分词结果为:

{
    "tokens": [
        {
            "token": "托尼",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "克罗斯",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "托尼",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "托尼",
            "start_offset": 0,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "tk",
            "start_offset": 0,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "克罗斯",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "罗斯",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "尼克",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "罗斯",
            "start_offset": 4,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "克罗斯",
            "start_offset": 4,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "罗斯",
            "start_offset": 4,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 3
        }
    ]
}

可以看到克罗斯出现了两次，其中有一次的start_offset和end_offset是错误的。

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

酒浓于脸红

暂无简介

0 文章

0 评论

25 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

亚希

文章 0 评论 0

cyp

文章 0 评论 0

北漠

文章 0 评论 0

11223456

文章 0 评论 0

坠似风落

文章 0 评论 0

游魂

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文