.NET 中检测简单拼写错误的算法
是否有现有的 .NET 算法能够从预定义单词列表中检测拼写错误?
例如,假设单词“Stuff”在我的列表中,并且有人输入“Stuf”、“sutff”或“stff”或“stiff”。我希望能够向该人建议“东西”这个词可能是正确的词。
我不是在谈论任何语法问题,也不是在谈论任何可能有一个字母丢失、替换或混合的事情。
我的目标是防止在不同类型的列表中输入相同的单词。并不是说大写和小写不会对我造成问题,因为所有内容始终都是小写。
Is there an existing algorithm for .NET that would be able to detects typos from a list of predefined words?
Example, suppose the word "Stuff" is in my list and someone types "Stuf", "sutff", or "stff" or "stiff". I would like to be able to suggest the person that the word "Stuff" might be the right word.
I am not talking about anything grammatical or anything more than maybe one letter missing, substituted or mixed.
My goal is to prevent the same word being entered in a list with different typing. Not that uppercase and lowercase do not cause a problem for me as everything is always lowercase.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是一个经过充分研究的问题,并且有许多很好的算法可以做到这一点。它们中的大多数通过构建某种数据结构来保存所有合法单词,其中具有相似的单词 可以高效地找到编辑距离。通俗地说,两个字符串之间的编辑距离是将一个字符串转换为另一个字符串所需的更改次数。例如,给定单词“misspell”和“mispell”,编辑距离为 1(只需在单词中插入另一个 's'),而“cat”和“dog”之间的编辑距离为 3(替换每个字母) 。
拼写错误的单词可能与预期的单词只有很小的编辑距离,因此,如果您可以以某种方式存储单词,对于任何字符串,您都可以查询与该字符串的编辑距离很小的单词,那么您可以可以提供用户可能表示的可能单词的候选列表。
保存此数据的一种常见数据结构是 trie,这是一种 26 路树结构,其中每个节点存储单词的前缀,每条边对应于在当前前缀上添加一些字母。如果您有这样的结构,您可以使用简单的递归树搜索来查找与特定单词“接近”(可能相距一定编辑距离)的单词。在每个点,记录您希望距目标单词多远的编辑距离以及到目前为止您已处理的拼写错误单词的数量。在每一点,您可以沿着特里树中与单词中的字母相对应的边,也可以通过沿着特里树中的不同边,用尽一个编辑距离来插入新字母。
另一种常用的数据结构是 BK-tree,它将字符串存储在一种可以有效查询距某个源字符串一定编辑距离的所有单词的方法。这会更直接地解决你的问题,尽管与尝试相比,关于如何构建 BK 树的在线资源较少。
一旦您找到了某个编辑距离内的单词,您可能希望在将它们呈现给用户之前以某种方式对它们进行排名。这要求您了解人们在实践中容易犯什么样的打字错误。常见的拼写错误包括
为了构建一个好的拼写检查器,理想情况下您应该拥有与每种类型的错误相关的某种概率。这样,当您获得可能的更正列表时,您可以将它们从最可能到最不可能进行排名。
希望这有帮助!
This is a well-studied problem and there are many good algorithms for doing this. Most of them work by constructing some sort of data structure to hold all of the legal words in a way where words with similar edit distance can be found efficiently. Informally, the edit distance between two strings is the number of changes you would need to make to transform one string into another. For example, given the words "misspell" and "mispell," the edit distance is one (just insert another 's' into the word), while the edit distance between "cat" and "dog" is three (replace each letter).
A misspelled word is likely only a small edit distance away from the word that was intended, so if you can store the words in a way where you can, for any string, query words that are a small edit distance away from the string, you can provide a candidate list of possible words that the user may have meant.
One common data structure for holding this data is a trie, a 26-way tree structure where each node stores a prefix of a word and each edge corresponds to appending some letter to the current prefix. If you have such a structure, you can find words that are "close" to a particular word (perhaps a certain edit distance away) using a simple recursive tree search. At each point, keep track of how far of an edit distance you wish to be away from the target word and how much of the misspelled word that you have processed so far. At each point, you can either follow the edge in the trie corresponding to the letter in the word, or you can use up one edit distance to insert a new letter by following a different edge in the trie.
Another data structure that is often used for this is a BK-tree, which stores strings in a way where you can efficiently query all words a certain edit distance away from some source string. This would more directly solve your problem, though there are fewer online resources on how to construct BK-trees compared with tries.
Once you've found the words that are within some edit distance, you'll probably want to rank them somehow before presenting them to the user. This requires you to have some idea of what sorts of typos people tend to make in practice. Common typos include
To build a good spell checker, you would ideally have some sort of probabilities associated with each type of mistake. That way, when you had your list of possible corrections, you could rank them from most probable to least probable.
Hope this helps!
这里有一个很好的逐步实现您正在寻找的在 python 中实现的内容,但也有指向 C# 和其他语言实现的链接。
http://norvig.com/spell- Correct.html
Here's a good step by step to do what you are looking for implemented in python, but there are links to implementations in C# and other languages too.
http://norvig.com/spell-correct.html
也许您想寻找三元组搜索?三元组搜索需要创建输入的 3 个字母的所有可能性,并在匹配中查找相似的字符串。
Maybe you want to look for a trigram search? A trigram search need to create every possibilites of 3 letters of your input and look for similar strings in the match.
发布我在 C# 中的实现,允许长度 >= 2 的字符串。
它检查 2 个单词是否匹配,忽略换位、替换、删除(对于删除后长度 >=2 的单词有效)和 重复(允许多次)。
Posting my implementation in C#, allowed for strings with length >= 2.
It checks if 2 words match disregarding Transpositions, Substitutions, Deletions (valid for words with length >=2 after deletion) and Repetitions (multiple allowed).