如何使用 ElasticSearch 搜索单词的一部分

发布于 2024-11-16 18:34:14 字数 1634 浏览 2 评论 0原文

我最近开始使用 ElasticSearch,但似乎无法让它搜索单词的一部分。

示例:我在 ElasticSearch 中索引了 couchdb 中的三个文档:

{
  "_id" : "1",
  "name" : "John Doeman",
  "function" : "Janitor"
}
{
  "_id" : "2",
  "name" : "Jane Doewoman",
  "function" : "Teacher"
}
{
  "_id" : "3",
  "name" : "Jimmy Jackal",
  "function" : "Student"
} 

所以现在,我想搜索包含“Doe”的所有文档

curl http://localhost:9200/my_idx/my_type/_search?q=Doe

,但不会返回任何命中。但如果我搜索

curl http://localhost:9200/my_idx/my_type/_search?q=Doeman

It 确实会返回一份文档(John Doeman)。

我尝试将不同的分析器和不同的过滤器设置为索引的属性。我还尝试过使用完整的查询(例如

{
  "query": {
    "term": {
      "name": "Doe"
    }
  }
}

:) 但似乎没有任何作用。

当我搜索“Doe”时,如何让 ElasticSearch 同时找到 John Doeman 和 Jane Doewoman?

更新

我尝试使用 nGram 分词器和过滤器,就像 Igor 提出的那样,如下所示:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "bulk_size": "100",
    "bulk_timeout": "10ms",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": [
            "my_ngram_filter"
          ]
        }
      },
      "filter": {
        "my_ngram_filter": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      }
    }
  }
}

我现在遇到的问题是每个查询都返回所有文档。 有什么指点吗?关于使用 nGram 的 ElasticSearch 文档不太好......

I've recently started using ElasticSearch and I can't seem to make it search for a part of a word.

Example: I have three documents from my couchdb indexed in ElasticSearch:

{
  "_id" : "1",
  "name" : "John Doeman",
  "function" : "Janitor"
}
{
  "_id" : "2",
  "name" : "Jane Doewoman",
  "function" : "Teacher"
}
{
  "_id" : "3",
  "name" : "Jimmy Jackal",
  "function" : "Student"
} 

So now, I want to search for all documents containing "Doe"

curl http://localhost:9200/my_idx/my_type/_search?q=Doe

That doesn't return any hits. But if I search for

curl http://localhost:9200/my_idx/my_type/_search?q=Doeman

It does return one document (John Doeman).

I've tried setting different analyzers and different filters as properties of my index. I've also tried using a full blown query (for example:

{
  "query": {
    "term": {
      "name": "Doe"
    }
  }
}

)
But nothing seems to work.

How can I make ElasticSearch find both John Doeman and Jane Doewoman when I search for "Doe" ?

UPDATE

I tried to use the nGram tokenizer and filter, like Igor proposed, like this:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "bulk_size": "100",
    "bulk_timeout": "10ms",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": [
            "my_ngram_filter"
          ]
        }
      },
      "filter": {
        "my_ngram_filter": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      }
    }
  }
}

The problem I'm having now is that each and every query returns ALL documents.
Any pointers? ElasticSearch documentation on using nGram isn't great...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

哆兒滾 2024-11-23 18:34:14

我也在使用 nGram。我使用标准分词器和 nGram 作为过滤器。这是我的设置:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "analysis": {
      "index_analyzer": {
        "my_index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "mynGram"
          ]
        }
      },
      "search_analyzer": {
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "mynGram"
          ]
        }
      },
      "filter": {
        "mynGram": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 50
        }
      }
    }
  }
}

让您查找最多 50 个字母的单词部分。根据需要调整 max_gram。在德语中,单词可能会变得很大,所以我将其设置为一个很高的值。

I'm using nGram, too. I use standard tokenizer and nGram just as a filter. Here is my setup:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "analysis": {
      "index_analyzer": {
        "my_index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "mynGram"
          ]
        }
      },
      "search_analyzer": {
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "mynGram"
          ]
        }
      },
      "filter": {
        "mynGram": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 50
        }
      }
    }
  }
}

Let's you find word parts up to 50 letters. Adjust the max_gram as you need. In german words can get really big, so I set it to a high value.

弱骨蛰伏 2024-11-23 18:34:14

我认为没有必要改变任何映射。
尝试使用query_string,它是完美的。所有场景都将使用默认标准分析器:

我们有数据:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

场景 1:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Doe*"}
} }

响应:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

场景 2:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Jan*"}
} }

响应:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}

场景 3:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*oh* *oe*"}
} }

响应:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

编辑-
与 Spring Data Elastic Search 相同的实现
https://stackoverflow.com/a/43579948/2357869

再解释一下query_string如何比其他更好
https://stackoverflow.com/a/43321606/2357869

I think there's no need to change any mapping.
Try to use query_string, it's perfect. All scenarios will work with default standard analyzer:

We have data:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

Scenario 1:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Doe*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

Scenario 2:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Jan*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}

Scenario 3:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*oh* *oe*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

EDIT -
Same implementation with spring data elastic search
https://stackoverflow.com/a/43579948/2357869

One more explanation how query_string is better than others
https://stackoverflow.com/a/43321606/2357869

紫竹語嫣☆ 2024-11-23 18:34:14

在大型索引上使用前导通配符和尾随通配符进行搜索将非常慢。如果您希望能够按单词前缀搜索,请删除前导通配符。如果您确实需要在单词中间查找子字符串,那么最好使用 ngram tokenizer。

Searching with leading and trailing wildcards is going to be extremely slow on a large index. If you want to be able to search by word prefix, remove leading wildcard. If you really need to find a substring in a middle of a word, you would be better of using ngram tokenizer.

怪我入戏太深 2024-11-23 18:34:14

在不更改索引映射的情况下,您可以执行一个简单的前缀查询,该查询将像您希望的那样进行部分搜索

{
  "query": { 
    "prefix" : { "name" : "Doe" }
  }
}

https://www.elastic.co /guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html

without changing your index mappings you could do a simple prefix query that will do partial searches like you are hoping for

ie.

{
  "query": { 
    "prefix" : { "name" : "Doe" }
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html

伪心 2024-11-23 18:34:14

虽然有很多答案都专注于解决手头的问题,但没有过多讨论人们在选择特定答案之前需要做出的各种权衡。因此,让我尝试在这个角度上添加更多细节。

部分搜索现在是一个非常常见且重要的功能,如果实施不当可能会导致糟糕的用户体验和糟糕的性能,因此首先了解您的应用程序与此功能相关的功能和非功能需求我在这个详细的答案中谈到过。

现在有各种方法,例如查询时间、索引时间、完成建议和输入数据类型时搜索在最新版本的elasticsarch中添加。

现在,想要快速实施解决方案的人们可以使用以下端到端工作解决方案。

索引映射

{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    },
    "index.max_ngram_diff" : 10
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete", 
        "search_analyzer": "standard" 
      }
    }
  }
}

给定示例文档的索引

{
  "title" : "John Doeman"
  
}

{
  "title" : "Jane Doewoman"
  
}

{
  "title" : "Jimmy Jackal"
  
}

和搜索查询

{
    "query": {
        "match": {
            "title": "Doe"
        }
    }
}

返回预期搜索结果

 "hits": [
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.76718915,
                "_source": {
                    "title": "John Doeman"
                }
            },
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.76718915,
                "_source": {
                    "title": "Jane Doewoman"
                }
            }
        ]

While there are a lot of answers which focuses on solving the issue at hand but don't talk much about the various trade-off which someone needs to make before choosing a particular answer. So let me try to add a few more details on this perspective.

Partial search is now a day a very common and important feature and if not implemented properly can lead to poor user experience and bad performance, so first know your application function and non-function requirement related to this feature which I talked about in my this detailed SO answer.

Now there are various approaches, like query time, index time, completion suggester and search as you type data-types added in recent version of elasticsarch.

Now people who quickly want to just implement a solution can use below end to end working solution.

Index mapping

{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    },
    "index.max_ngram_diff" : 10
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete", 
        "search_analyzer": "standard" 
      }
    }
  }
}

Index given sample docs

{
  "title" : "John Doeman"
  
}

{
  "title" : "Jane Doewoman"
  
}

{
  "title" : "Jimmy Jackal"
  
}

And search query

{
    "query": {
        "match": {
            "title": "Doe"
        }
    }
}

which returns expected search results

 "hits": [
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.76718915,
                "_source": {
                    "title": "John Doeman"
                }
            },
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.76718915,
                "_source": {
                    "title": "Jane Doewoman"
                }
            }
        ]
无言温柔 2024-11-23 18:34:14

我正在使用这个并开始工作

<前><代码>“查询”:{
“查询字符串”:{
“查询”:“*测试*”,
“字段”:[“字段1”,“字段2”],
“analyze_wildcard”:正确,
“allow_leading_wildcard”:true
}
}

I am using this and got I worked

"query": {
        "query_string" : {
            "query" : "*test*",
            "fields" : ["field1","field2"],
            "analyze_wildcard" : true,
            "allow_leading_wildcard": true
        }
    }
赢得她心 2024-11-23 18:34:14

尝试使用此处描述的解决方案: ElasticSearch 中的精确子字符串搜索

{
    "mappings": {
        "my_type": {
            "index_analyzer":"index_ngram",
            "search_analyzer":"search_ngram"
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "index_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "ngram_filter", "lowercase" ]
                },
                "search_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": "lowercase"
                }
            }
        }
    }
}

解决磁盘使用问题和搜索词过长问题 短 8 个字符长使用ngrams(配置为:“max_gram”:8)。要搜索超过 8 个字符的术语,请将搜索转换为布尔 AND 查询,查找该字符串中每个不同的 8 字符子字符串。例如,如果用户搜索largeyard(10 个字符的字符串),搜索结果将为:

“arge ya AND arge yar AND rgeyard

Try the solution with is described here: Exact Substring Searches in ElasticSearch

{
    "mappings": {
        "my_type": {
            "index_analyzer":"index_ngram",
            "search_analyzer":"search_ngram"
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "index_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "ngram_filter", "lowercase" ]
                },
                "search_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": "lowercase"
                }
            }
        }
    }
}

To solve the disk usage problem and the too-long search term problem short 8 characters long ngrams are used (configured with: "max_gram": 8). To search for terms with more than 8 characters, turn your search into a boolean AND query looking for every distinct 8-character substring in that string. For example, if a user searched for large yard (a 10-character string), the search would be:

"arge ya AND arge yar AND rge yard.

や三分注定 2024-11-23 18:34:14

如果您想实现自动完成功能,则 Completion Suggester 是最简洁的解决方案。下一篇博客文章包含非常清晰的描述其工作原理。

简而言之,它是一种称为 FST 的内存数据结构,其中包含有效的建议,并针对快速检索和内存使用进行了优化。本质上,它只是一个图表。例如,包含单词 hotelmarriotmercuremunchenmunich 的 FST code> 看起来像这样:

输入图像描述这里

If you want to implement autocomplete functionality, then Completion Suggester is the most neat solution. The next blog post contains a very clear description how this works.

In two words, it's an in-memory data structure called an FST which contains valid suggestions and is optimised for fast retrieval and memory usage. Essentially, it is just a graph. For instance, and FST containing the words hotel, marriot, mercure, munchen and munich would look like this:

enter image description here

童话 2024-11-23 18:34:14

你可以使用正则表达式。

{ "_id" : "1", "name" : "John Doeman" , "function" : "Janitor"}
{ "_id" : "2", "name" : "Jane Doewoman","function" : "Teacher"  }
{ "_id" : "3", "name" : "Jimmy Jackal" ,"function" : "Student"  } 

如果您使用此查询:

{
  "query": {
    "regexp": {
      "name": "J.*"
    }
  }
}

您将给出名称以“J”开头的所有数据。考虑您只想接收名称以“man”结尾的前两条记录,因此您可以使用此查询:

{
  "query": { 
    "regexp": {
      "name": ".*man"
    }
  }
}

如果您愿意要接收其名称中存在“m”的所有记录,您可以使用此查询:

{
  "query": { 
    "regexp": {
      "name": ".*m.*"
    }
  }
}

这对我有用。我希望我的答案适合解决您的问题。

you can use regexp.

{ "_id" : "1", "name" : "John Doeman" , "function" : "Janitor"}
{ "_id" : "2", "name" : "Jane Doewoman","function" : "Teacher"  }
{ "_id" : "3", "name" : "Jimmy Jackal" ,"function" : "Student"  } 

if you use this query :

{
  "query": {
    "regexp": {
      "name": "J.*"
    }
  }
}

you will given all of data that their name start with "J".Consider you want to receive just the first two record that their name end with "man" so you can use this query :

{
  "query": { 
    "regexp": {
      "name": ".*man"
    }
  }
}

and if you want to receive all record that in their name exist "m" , you can use this query :

{
  "query": { 
    "regexp": {
      "name": ".*m.*"
    }
  }
}

This works for me .And I hope my answer be suitable for solve your problem.

念﹏祤嫣 2024-11-23 18:34:14

使用通配符 (*) 可以防止计算分数

Using wilcards (*) prevent the calc of a score

东北女汉子 2024-11-23 18:34:14

没关系。

我必须查看 Lucene 文档。
看来我可以使用通配符! :-)

curl http://localhost:9200/my_idx/my_type/_search?q=*Doe*

成功了!

Nevermind.

I had to look at the Lucene documentation.
Seems I can use wildcards! :-)

curl http://localhost:9200/my_idx/my_type/_search?q=*Doe*

does the trick!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文