lucene模糊搜索是懒惰的吗?

发布于 2024-09-06 19:52:15 字数 157 浏览 10 评论 0原文

我想使用 Lucene 的模糊搜索,我理解它是基于某种类似 Levenshtein 的算法。如果我使用相当高的阈值(即“new york~0.9”),它会首先计算编辑距离,然后查看它是否小于 0.9 对应的值,或者如果它变得明显,它会切断算法该文档与查询的匹配程度不高?我知道这可以通过编辑算法实现。

I would like to use Lucene's fuzzy search, which I understand is based on some sort of Levenshtein-like algorithm. If I use a fairly high threshold (i.e, "new york~0.9"), will it first compute the edit distance and then see if it is less than whatever 0.9 corresponds to, or will it cut off the algorithm if it becomes apparent that the document does not match the query that closely? I understand that that is possible with the levenshtein algorithm.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

预谋 2024-09-13 19:52:15

如果文档明显与查询不匹配,它会中断算法吗?

不。您要查看的代码是 FuzzyTermEnum 的第 57-59 行:

int dist = editDistance(text, target, textlen, targetlen);
distance = 1 - ((double)dist / (double)Math.min(textlen, targetlen));
return (distance > FUZZY_THRESHOLD);

您可以看到它计算距离,如果小于阈值则返回。

你为什么关心这个?除非您的术语有数千个字符长,否则计算完整编辑距离会非常快。

will it cut off the algorithm if it becomes apparent that the document does not match the query that closely?

No. The code you want to see is lines 57-59 of FuzzyTermEnum:

int dist = editDistance(text, target, textlen, targetlen);
distance = 1 - ((double)dist / (double)Math.min(textlen, targetlen));
return (distance > FUZZY_THRESHOLD);

You can see that it calculates the distance, then returns if that is less than the threshold.

Why do you care about this though? Unless your terms are thousands of characters long, calculating the full edit distance will be really quick.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文