elasticsearch 同义词导致start_offset改变是怎么回事?
设置的同义词如下:
托尼-克罗斯=>托尼-克罗斯,克罗斯,托尼克罗斯,托尼,tk
index setting如下:
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms_path": "my_synonym.txt",
"lenient": "true"
}
},
"analyzer": {
"my_ik_analyzer": {
"filter": [
"my_synonym"
],
"type": "custom",
"tokenizer": "my_ik_token"
}
},
"tokenizer": {
"my_ik_token": {
"type": "ik_max_word"
}
}
}
}
}
}
tokenizer(my_ik_token
)分词托尼-克罗斯
结果为
{
"tokens":[
{
"token":"托尼",
"start_offset":0,
"end_offset":2,
"type":"CN_WORD",
"position":0
},
{
"token":"克罗斯",
"start_offset":3,
"end_offset":6,
"type":"CN_WORD",
"position":1
},
{
"token":"罗斯",
"start_offset":4,
"end_offset":6,
"type":"CN_WORD",
"position":2
}
]
}
加上了synonym filter
的analyzer(my_ik_analyzer)
分词结果为:
{
"tokens": [
{
"token": "托尼",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "克罗斯",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "托尼",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "托尼",
"start_offset": 0,
"end_offset": 6,
"type": "SYNONYM",
"position": 0
},
{
"token": "tk",
"start_offset": 0,
"end_offset": 6,
"type": "SYNONYM",
"position": 0
},
{
"token": "克罗斯",
"start_offset": 3,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
},
{
"token": "罗斯",
"start_offset": 3,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
},
{
"token": "尼克",
"start_offset": 3,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
},
{
"token": "罗斯",
"start_offset": 4,
"end_offset": 6,
"type": "SYNONYM",
"position": 2
},
{
"token": "克罗斯",
"start_offset": 4,
"end_offset": 6,
"type": "SYNONYM",
"position": 2
},
{
"token": "罗斯",
"start_offset": 4,
"end_offset": 6,
"type": "SYNONYM",
"position": 3
}
]
}
可以看到克罗斯
出现了两次,其中有一次的start_offset
和end_offset
是错误的。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论