遍历 Trie 来检查拼写建议的好算法是什么?
假设建立了一个通用的字典单词Trie,那么在遍历过程中检查替换、删除、转置和插入这四种拼写错误的最佳方法是什么?
一种方法是找出给定单词的 n 个编辑距离内的所有单词,然后在 Trie 中检查它们。这不是一个坏选择,但这里更好的直觉似乎是使用动态编程(或递归等效)方法来确定在遍历期间修改单词后的最佳子尝试。
任何想法都会受到欢迎!
PS,希望得到实际的输入,而不仅仅是答案中的链接。
Assuming that a general Trie of dictionary words is built, what would be the best method to check for the 4 cases of spelling mistakes - substitution, deletion, transposition and insertion during traversal?
One method is to figure out all the words within n edit distances of a given word and then checking for them in the Trie. This isn't a bad option, but a better intuition here seems to be use a dynamic programming (or a recursive equivalent) method to determine the best sub-tries after having modified the words during traversal.
Any ideas would be welcome!
PS, would appreciate actual inputs rather than just links in answers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
前几天我实际上编写了一些代码来执行此操作:
https:// bitbucket.org/teoryn/spell-checker/src/tip/spell_checker.py
它基于 Peter Norvig 的代码 (http://norvig.com/spell- Correct.html),但将字典存储在特里树中,以便更快地查找给定编辑距离内的单词。
该算法通过消耗输入单词中的字母,递归地遍历特里树,在每一步中应用可能的编辑(或不应用)。递归调用的参数表明还可以进行多少次编辑。 trie 通过检查从我们给定的前缀实际上可以到达哪些字母来帮助缩小搜索空间。例如,当插入一个字符时,我们不是添加字母表中的每个字母,而是只添加从当前节点可达的字母。不进行编辑相当于从 trie 中的当前节点沿着输入单词的当前字母获取分支。如果该分支不存在,那么我们可以回溯并避免搜索可能找不到真正单词的大空间。
I actually wrote some code to do this the other day:
https://bitbucket.org/teoryn/spell-checker/src/tip/spell_checker.py
It's based on the code by Peter Norvig (http://norvig.com/spell-correct.html) but stores the dictionary in a trie for finding words within a given edit distance faster.
The algorithm walks the trie recursively applying the possible edits (or not) at each step along the way by consuming letters from the input word. A parameter to the recursive call states how many more edits can be made. The trie helps narrow the search space by checking which letters can actually be reached from our given prefix. For example, when inserting a character, instead of adding each letter in the alphabet, we only add letters that are reachable from the current node. Not making an edit is equivalent to taking the branch from the current node in the trie along the current letter from the input word. If that branch is not there then we can backtrack and avoid searching a possibly large space where no real words could be found.
我认为您可以通过在树上进行简单的广度优先搜索来做到这一点:选择您要查找的错误数量的阈值,只需一次遍历要匹配的单词的字母,生成一组到目前为止已达到与前缀匹配的 (prefix, subtrie) 对,当您低于错误阈值时,请添加到您的下一个子目标集中:
这看起来很天真:是否有一个问题让您想到动态规划?
I think you can do this with a straightforward breadth-first search on the tree: choose a threshold of the number of errors you are looking for, simply run through the letters of the word to be matched one at a time, generating a set of (prefix, subtrie) pairs reached so far matching the prefix, and while you are beneath your error threshold, add to your set of next subgoals:
This seems pretty naive: is there a problem with this that led you to think of dynamic programming?
假设单词中的每个连续字符代表树中的一个级别,那么您需要检查每个字符的五种情况(匹配、删除、插入、替换和转置)。我假设换位是两个相邻的字符。
您将需要一个函数(CheckNode)来接受树节点和要检查的字符。它将需要返回一组表示匹配的(子/孙)节点。
您将需要一个接受单词的函数(CheckWord)。它根据一组节点依次检查每个字符。它将返回一组表示匹配单词的(叶)节点。
这个想法是,树中的每个级别(子级、孙级等)都与单词中字符的位置相匹配。如果您将顶级树节点称为级别 0,那么您将拥有级别 1、级别 2 等。
显然,对于没有错误的单词,字符位置和树中的级别之间存在一对一的匹配。
对于删除,您需要跳过树中的一级。
对于插入,您需要跳过单词中的一个字符。
对于替换,您需要跳过一个级别和一个角色。
对于转置,您需要(暂时)交换单词中的字符。
Presuming each successive character in your word represents one level in your tree, then you have five cases to check at each character (a match, deletion, insertion, substitution and transposition). I'm assuming transpositions are two adjacent characters.
You will need a function (CheckNode) that accepts a tree node and a character to check. It will need to return a set of (child/grand-child) nodes representing matches.
You will need a function (CheckWord) that accepts a word. It checks each character in turn against a set of nodes. It will return a set of (leaf) nodes representing matched words.
The idea is that each level in the tree (child, grand-child etc.), matches the position of the character in the word. If you call the top level tree node, level 0, then you'll have level 1, level 2 etc.
Clearly for a word without errors, there is a one to one match between the character position and the level in the tree.
For deletions, you need to skip a level in the tree.
For insertions, you need to skip a character in the word.
For substitutions, you need to skip both a level and a character.
For transpositions, you need to (temporarily) swap the characters in the word.
看一下计算 Levenshtein 距离,它提供了一种动态编程解决方案,用于查找两个之间的距离序列。
Take a look at calculating the Levenshtein distance which provides a dynamic programming solution for finding the distance between two sequences.