最小编辑距离快速估计
我们有基于 Levenshtein 距离 的拼写检查器实现。由于我们无法计算所有可能替换的距离(在 O(n^2)
中计算的两个字符串之间的编辑距离),我们使用 K-gram 索引 用于检索替换候选。
所以K-gram索引只是快速消除不相关替换的方法之一。我也对其他方式感兴趣。目前我们还使用了一些技巧。考虑到我们只对编辑距离的替换感兴趣,不再是原始字符串的 d,我们可以使用以下规则:
- 两个字符串之间的编辑距离不能小于它们之间的长度差。因此长度差大于d的替换可以被消除;
- 字符串更改中的一个字符更改/删除至少
k
k-grams。因此,计数差异为 k-gramsk * d
的字符串的编辑距离不能小于 d: 。
这些假设正确吗?还有哪些其他消除替换的方法适用于拼写检查器?
We have spellchecker implementation based on Levenshtein distance. As we couldn't calculate distance for all possible substitutions (Levenshtein distance between two string calculated in O(n^2)
) we use K-gram index for retrieving candidates for substitution.
So K-gram index is just one of the ways for fast eliminating irrelevant substitution. I'm interested in other ways as well. At the moment we used several more tricks. Considering that we are only interested in substitutions of edit distance no more the d from the original string we could use following rules:
- the edit distance between two string couldn't be less that length difference between them. So substitutions with length difference greater than d could be eliminated;
- one character change/remove in string change at least
k
k-grams. So the strings with count difference of k-gramsk * d
could not have edit distance less than d: .
Are those assumptions correct? And what other ways of substitutions eliminating exists applicable to spellcheckers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用简单的规则将搜索限制为以与查询字符串相同的字母开头的字典术语。希望用户不要拼错第一个字母。
此外,您还可以使用排列索引。考虑查询的所有旋转并遍历 B 树以查找与任何旋转匹配的任何字典术语。您还可以通过在执行遍历之前省略 l 字符的后缀来完善此旋转方案
you can use the simple rule to restrict the search to dictionary terms beginning with the same letter as the query string. The hope is that users do not misspell the first letter.
Also, you can use a permuterm index. consider all rotations of the query and traverse the B tree to find any dictionary terms that match any rotation. You can also refine this rotation scheme by omitting a suffix of l characters before performing the traversal
根据我的经验,k-gram 近似还有很多不足之处(它排除了许多相关结果)。
相反,将您的术语放入自动机/转换器、特里树甚至排序数组中就足够了,并通过交集找到真正的编辑匹配。
如果你想一想,它很直观:如果你只想要距离为 1 内的单词,并且输入项是“foo”,那么在检查“b”节点时检查“bar”、“baz”等是没有意义的。只有 boo、bfoo 等有机会,因此您可以将搜索限制为仅可能导致最终状态的前缀。
因此,您只需创建一个自动机,它接受“foo”的 k 编辑距离内的所有单词,然后将该自动机与您的字典自动机/trie/其他内容相交。
您可以非常有效地计算这些 DFA,避免任何缓慢的 NFA-DFA 确定等:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
in my experience the k-gram approximation leaves a lot to be desired (it excludes many relevant results).
instead put your terms in an automaton/transducer, trie, or even a sorted array can suffice, and find the true levenshtein matches via intersection.
its intuitive if you think about it: if you only want words within a distance of 1, and the input term is "foo", there is no sense in examining "bar", "baz" etc when examining the 'b' nodes. only boo, bfoo, etc stand a chance so you can restrict the search to only prefixes that can possibly lead to a final state.
so you just create an automaton that accepts all words within k edit distances of "foo" and then intersect this automaton with your dictionary automaton/trie/whatever.
you can compute these these DFAs extremely efficiently, avoiding any slow NFA-DFA determinization, etc:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652