Elasticsearch char_filter不影响搜索

发布于 2025-01-19 14:31:13 字数 2403 浏览 3 评论 0原文

我对 char_filter 工作原理的理解一定是错误的。我的目标是在 Elasticsearch 中同等对待所有撇号和引号字符(在本例中,完全删除它们)。 (显然有 5 个类似撇号的 unicode 字符......并且我的数据库有所有版本:facepalm:)

旁白:这种解决方案的灵感来自 这个线程

所以这是一个玩具问题,说明了我的 问题。 我使用 char_filter 创建一个索引,然后用 3 个文档填充它:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "quote_analyzer": {
          "char_filter": [
            "quotes"
          ],
          "tokenizer": "standard"
        }
      },
      "char_filter": {
        "quotes": {
          "mappings": [
            "\u0091=>",
            "\u0092=>",
            "\u2018=>",
            "\u2019=>"
          ],
          "type": "mapping"
        }
      }
    }
  }
}

POST test/_doc
{
  "name": "The King’s men",
  "id": "1"
}

POST test/_doc
{
  "name": "Zoom LeBron the Soldier 7 'King's Pride'",
  "id": "2"
}

POST test/_doc
{
  "name": "Kings Kings Kings",
  "id": "3"
}

如您所见,每个文档都包含单词 Kings 的某种形式。然后,我检查我的分析器是否正在执行我认为应该执行的操作:

GET test/_analyze
{
  "analyzer": "quote_analyzer",
  "text": "King’s boat"
}

得出结果:

{
  "tokens" : [
    {
      "token" : "Kings",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "boat",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

看来 King's 中的撇号已被删除,标记为 Kings。伟大的!所以现在我想搜索 King's 并且由于分析器正在删除撇号,我应该得到所有三个结果。或者至少我会得到id:3,因为撇号被删除,并且它只匹配没有撇号的Kings Kings Kings。但是,搜索:

GET test/_search 
{
  "query": {
    "match": {
      "name": "King’s boat"
    }
  }
}

产量:

{
  "took" : 1,
  // collapsing ....
  "hits" : {
     // collapsing ....
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1e2x_38Bn0QWlup8OIvp",
        "_score" : 1.1220688,
        "_source" : {
          "name" : "The King’s men",
          "id" : "1"
        }
      }
    ]
  }
}

同样,搜索 Kings Boat 仅检索 id:3。搜索 King's Boat 仅检索 id:2

我缺少什么?如何实现对所有撇号字符一视同仁的目标?

My understanding of how the char_filter works must be wrong. My goal here is to treat all apostrophes and quote like characters the same (in this case, remove them entirely) in elasticsearch. (Apparently there are like 5 apostrophe-like unicode characters... and my database has all versions :facepalm:)

Aside: This approach to the solution was inspired by this thread

So here is a toy problem that illustrates my issue.
I create an index with the char_filter, and then populate it with 3 documents:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "quote_analyzer": {
          "char_filter": [
            "quotes"
          ],
          "tokenizer": "standard"
        }
      },
      "char_filter": {
        "quotes": {
          "mappings": [
            "\u0091=>",
            "\u0092=>",
            "\u2018=>",
            "\u2019=>"
          ],
          "type": "mapping"
        }
      }
    }
  }
}

POST test/_doc
{
  "name": "The King’s men",
  "id": "1"
}

POST test/_doc
{
  "name": "Zoom LeBron the Soldier 7 'King's Pride'",
  "id": "2"
}

POST test/_doc
{
  "name": "Kings Kings Kings",
  "id": "3"
}

As you can see, each document contains some form of the word Kings. I then check that my analyzer is doing what I think it should be doing:

GET test/_analyze
{
  "analyzer": "quote_analyzer",
  "text": "King’s boat"
}

Which yields:

{
  "tokens" : [
    {
      "token" : "Kings",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "boat",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

It appears that the apostrophe in King’s has been removed and the token is Kings. Great! So now I want to search for King’s and since the analyzer is removing the apostrophe I should get all three results. Or at LEAST I would get just id:3 as the apostrophe was removed, and it only matches that Kings Kings Kings without the apostrophe. However, searching for:

GET test/_search 
{
  "query": {
    "match": {
      "name": "King’s boat"
    }
  }
}

Yields:

{
  "took" : 1,
  // collapsing ....
  "hits" : {
     // collapsing ....
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1e2x_38Bn0QWlup8OIvp",
        "_score" : 1.1220688,
        "_source" : {
          "name" : "The King’s men",
          "id" : "1"
        }
      }
    ]
  }
}

Similarly, searching Kings boat only retrieves id:3. And searching King's boat only retrieves id:2.

What am I missing? How do I accomplish the goal of treating all apostrophe characters the same?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

书间行客 2025-01-26 14:31:13

请修改您的 char_filter 以容纳引号和撇号,就像您对引号所做的那样。

Please modify your char_filter to accommodate both quotes and apostrophe, like you already did for quotes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文