非常快速地从数据库获取模糊字符串匹配

发布于 2024-08-18 15:35:46 字数 246 浏览 7 评论 0原文

我有一个大约 150'000 个单词和一个模式(任何单个单词)的数据库,我想从数据库中获取所有单词,该单词与模式之间的 Damerau-Levenshtein 距离小于给定数字。我需要非常快做到这一点。你能建议什么算法?如果 Damerau-Levenshtein 距离没有好的算法,那么 Levenshtin 距离也可以。

感谢您的帮助。

PS 我不会使用 SOUNDEX。

I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distance, just Levenshtin distance will be welcome as well.

Thank you for your help.

P.S. I'm not going to use SOUNDEX.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

梦萦几度 2024-08-25 15:35:46

我会从一个 SQL 函数开始计算 Levenshtein 距离(在 T-SQl 或 .Net 中)(是的,我是 MS 人员...),其中最大距离参数会导致提前退出。

然后,可以使用此函数将您的输入与每个字符串进行比较,以检查距离,如果超过阈值则继续进行下一个。

例如,我还认为您可以将最大距离设置为 2,然后过滤长度大于 1 且第一个字母不同的所有单词。使用索引,这可能会稍微快一些。

您还可以快捷方式带回所有完美匹配的字符串(索引将加快速度),因为这些字符串实际上需要更长的时间来计算 0 的编辑距离。

只是一些想法......

I would start with a SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I'm a MS person...) with a maximum distance parameter that would cause an early exit.

This function could then be used to compare your input with each string to check the distanve and move on to the next if it breaks the threshold.

I was also thinking you could, for example, set the maximum distance to be 2, then filter all words where the length is more than 1 different whilst the first letter is different. With an index this may be slightly quicker.

You could also shortcut to bring back all strings that are perfect matches (indexing will speed this up) as these will actually take longer to calculate the Levenshtein distance of 0.

Just some thoughts....

迷迭香的记忆 2024-08-25 15:35:46

我认为您无法在不实际枚举所有行的情况下计算此类函数。
所以解决方案是:

  1. 使其成为一个非常快速的枚举(但这并不能真正扩展)
  2. 以某种方式过滤初始变体(按字母索引,至少 x 个常见字母)
  3. 使用替代(可索引)算法,例如 N-Grams(但是我没有关于 ngrams 与 DL 距离的结果质量的详细信息)。

I do not think you can calculate this kind of function without actually enumerating all rows.
So the solutions are:

  1. Make it a very fast enumeration (but this doesn't really scale)
  2. Filter initial variants somehow (index by a letter, at least x common letters)
  3. Use alternative (indexable) algorithm, such as N-Grams (however I do not have details on result quality of ngrams versus D-L distance).
还不是爱你 2024-08-25 15:35:46

我想到的一个解决方案可能是将数据库存储在排序集中(例如,C++ 中的 std::set),因为在我看来,按字典顺序排序的字符串会比较好。要估计给定字符串在 set 中的位置,请在字符串上使用 std::upper_bound,然后从找到的位置在两个方向上向外迭代该集合,计算移动的距离,并在低于某个阈值时停止。我有一种感觉,这个解决方案可能只会匹配具有相同起始字符的字符串,但如果您使用该算法进行拼写检查,那么这种限制是常见的,或者至少不足为奇。

编辑:但是,如果您正在寻找算法本身的优化,那么这个答案是无关紧要的。

A solution off the top of my head might be to store the database in a sorted set (e.g., std::set in C++), as it seems to me that strings sorted lexicographically would compare well. To approximate the position of the given string in the set, use std::upper_bound on the string, then iterate over the set outward from the found position in both directions, computing the distance as you go, and stop when it falls below a certain threshold. I have a feeling that this solution would probably only match strings with the same start character, but if you're using the algorithm for spell-checking, then that restriction is common, or at least unsurprising.

Edit: If you're looking for an optimisation of the algorithm itself, however, this answer is irrelevant.

怂人 2024-08-25 15:35:46

我使用 KNIME 进行字符串模糊匹配,并且很快得到了结果。在其中制作可视化工作流程也非常容易。只需从 https://www.knime.org/ 安装 KNIME 免费版,然后使用“String Distance”并“相似性搜索”节点来获取结果。我在这里附加了一个小的模糊匹配样本工作流程(在这种情况下,输入数据来自顶部,要搜索的模式来自底部):
在此处输入图像描述

I have used KNIME for string fuzzy matching and has got very fast results. It is also very easy to make visual workflows in it. Just install KNIME free edition from https://www.knime.org/ then use "String Distance" and "Similarity Search" nodes to get your results. I have attached a small fuzzy matching smaple workflow in here (the input data come from top and the patterns to search for come from the bottom in this case):
enter image description here

双手揣兜 2024-08-25 15:35:46

我建议查看 Ankiro

我不确定它是否满足您对精度的要求,但它很快。

I would recommend looking into Ankiro.

I'm not certain that it meets your requirements for precision, but it is fast.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文