elasticsearch模糊匹配max_expansions &最小相似度

发布于 2024-11-30 16:39:09 字数 1066 浏览 7 评论 0原文

我在项目中使用模糊匹配主要是为了查找同名的拼写错误和不同拼写。我需要准确理解elasticsearch的模糊匹配是如何工作的以及它如何使用标题中提到的2个参数。

据我了解，min_similarity是查询的字符串与数据库中的字符串匹配的百分比。我找不到关于如何计算该值的准确描述。

据我了解，max_expansions是执行搜索的编辑距离。如果这实际上是编辑距离，那对我来说将是理想的解决方案。无论如何，它不起作用例如，我有“Samvel”这个词，

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

该文档说了一些我实际上不明白的内容：

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

那么请任何人向我解释一下这些参数究竟如何影响搜索结果。

原文

I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.

As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.

The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working
for example i have the word "Samvel"

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

The documentation says something I actually do not understand:

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

So can please anyone explain to me how exactly these parameters affect the search results.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡墨 2024-12-07 16:39:09

min_similarity 是一个介于 0 到 1 之间的值。来自 Lucene 文档：

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

所引用的“编辑距离”是 Levenshtein 距离。

此查询的内部工作方式是：

它查找索引中存在的所有可能与搜索项匹配的术语，在考虑 min_similarity 时
，然后搜索所有这些术语。

您可以想象这个查询有多么繁重！

为了解决这个问题，您可以设置 max_expansions 参数来指定应考虑的匹配术语的最大数量。

The min_similarity is a value between zero and one. From the Lucene docs:

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

The 'edit distance' that is referred to is the Levenshtein distance.

The way this query works internally is:

it finds all terms that exist in the index that could match the search term, when taking the min_similarity into account
then it searches for all of those terms.

You can imagine how heavy this query could be!

To combat this, you can set the max_expansions parameter to specify the maximum number of matching terms that should be considered.

回复收藏 0 原文

~没有更多了~