我的任务是为作业创建一个简单的拼写检查器,但几乎没有提供任何指导,所以想知道是否有人可以帮助我。我并不是在找人为我做作业,但任何关于算法的指导或帮助都会很棒!如果我所问的内容不在该网站的指导范围内,那么我很抱歉,我会去其他地方寻找。 :)
该项目加载正确拼写的小写单词,然后需要根据两个标准提出拼写建议:
所以,只是为了确保我已经正确解释了..我可能会加载单词[你好,再见,太棒了,好,上帝],然后对(错误拼写的)单词“godd”的建议将是[好,上帝]。
速度是我在这里主要考虑的因素,因此虽然我认为我知道完成这项工作的方法,但我真的不太确定它的效率如何。我考虑的方法是创建一个 map>
,然后对于加载的每个正确拼写的单词,添加正确拼写的工作作为键在地图中并将向量填充为该单词的所有可能的“错误”排列。
然后,当我想要查找一个单词时,我将查看地图中的每个向量,看看该单词是否是拼写正确的单词之一的排列。如果是,我将添加该键作为拼写建议。
但这似乎会占用大量内存,因为每个单词肯定会有数千种排列?如果我最初的拼写正确的单词词典很大,它似乎也会非常非常慢?
我想也许我可以通过只查看与我正在查看的单词相似的键来减少时间。但话又说回来,如果它们在某些方面相似,那么这可能意味着关键将是一个建议,意味着我不需要所有这些排列!
所以,是的,我有点困惑我应该朝哪个方向看。我真的很感激任何帮助,因为我真的不确定如何估计不同做事方式的速度(我们还没有被教导过这一点)完全在课堂上)。
I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
-
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
-
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>>
and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
发布评论
评论(4)
解决问题更简单的方法确实是预先计算的地图[坏话] -> [建议]。
问题是,虽然删除一个字母会产生很少的“坏词”,但对于添加或替换,您有许多候选者。
所以我建议另一种解决方案;)
注意:您描述的编辑距离称为Levenshtein 距离
解决方案以增量步骤进行描述,通常每个想法的搜索速度应该不断提高,我尝试首先用更简单的想法(在实现方面)来组织它们。只要您对结果感到满意,就可以随时停止。
0。初步
std::deque
或std::vector
性能会更好)要点:
后一个属性允许短路实现:如果您想将自己限制为 2 个错误(阈值),那么每当当前行的最小值优于 2 时,您可以放弃计算。一个简单的策略是返回阈值 + 1 作为距离。
1.第一个尝试
让我们从简单的开始。
我们将实现线性扫描:对于每个单词,我们计算距离(短路),并列出迄今为止实现较小距离的单词。
它在小型词典上效果很好。
2.改进数据结构
编辑距离至少等于长度差。
通过使用(长度,单词)对而不是单词作为键,您可以将搜索限制在长度
[length - edit, length + edit]
的范围内,并大大减少搜索空间。3.前缀和修剪
为了改进这一点,我们可以说,当我们逐行构建距离矩阵时,一个世界被完全扫描(我们寻找的单词),但另一个世界(所指对象)却没有:我们每行仅使用一个字母。
这个非常重要的属性意味着对于共享相同初始序列(前缀)的两个指示对象,矩阵的第一行将是相同的。
还记得我要求你存储排序后的字典吗?这意味着共享相同前缀的单词是相邻的。
假设您正在对照
cartoon
检查您的单词,并且在car
处您意识到它不起作用(距离已经太长),那么任何以car 开头的单词
也不起作用,您可以跳过以car
开头的单词。跳过本身可以通过线性或搜索来完成(找到第一个比
car
具有更高前缀的单词):“长”有多长取决于您的字典,您必须进行测量。我会从二分搜索开始。
注意:长度分区与前缀分区相反,但它会修剪更多的搜索空间
4。前缀和重用
现在,我们还将尝试尽可能地重用计算(而不仅仅是“它不起作用”的结果)
假设你有两个单词:
你首先逐行计算
cartoon
的矩阵。然后在读取carwash
时需要确定公共前缀(这里是car
)的长度,并且可以保留矩阵的前4行(对应于void,c
、a
、r
)。因此,当开始计算
carwash
时,您实际上是从w
开始迭代。为此,只需使用在搜索开始时直接分配的数组,并使其足够大以容纳更大的引用(您应该知道字典中的最大长度是多少)。
5.使用“更好”的数据结构
为了更轻松地使用前缀,您可以使用 Trie 或存储字典的 Patricia 树。然而,它不是 STL 数据结构,您需要对其进行扩充,以在每个子树中存储所存储的单词长度范围,因此您必须自己实现。这并不像看起来那么容易,因为存在内存爆炸问题,可能会杀死局部性。
这是最后的选择。实施起来成本很高。
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
std::set
for example, though a sortedstd::deque
orstd::vector
would be better performance wise)Keys Points:
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length
[length - edit, length + edit]
and greatly reduce the search space.3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against
cartoon
and atcar
you realize it does not work (the distance is already too long), then any word beginning bycar
won't work either, you can skip words as long as they begin bycar
.The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than
car
):How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
You first compute the matrix, row by row, for
cartoon
. Then when readingcarwash
you need to determine the length of the common prefix (herecar
) and you can keep the first 4 rows of the matrix (corresponding to void,c
,a
,r
).Therefore, when begin to computing
carwash
, you in fact begin iterating atw
.To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
您应该看看 Peter Norvig 关于如何编写拼写校正器的解释。
如何编写拼写校正器
本文中对所有内容进行了很好的解释,作为示例的 python 代码拼写检查器看起来像这样:
希望您能在 Peter Norvig 网站上找到您需要的内容。
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
Hope you can find what you need on Peter Norvig website.
对于拼写检查器来说,许多数据结构都很有用,例如 BK-Tree。检查 该死的酷算法,第 1 部分:BK-Trees 我已经完成了相同的实现 此处
我之前的代码链接可能会产生误导,这个对于拼写校正器来说是正确的。
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
在我的脑海中,您可以根据长度分割您的建议,并构建一个树结构,其中子项是较短父项的较长变体。
应该很快,但我不确定代码本身,我不太熟悉 C++
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++