简单的拼写检查算法

发布于 2024-12-11 02:53:22 字数 910 浏览 1 评论 0 原文

我的任务是为作业创建一个简单的拼写检查器,但几乎没有提供任何指导,所以想知道是否有人可以帮助我。我并不是在找人为我做作业,但任何关于算法的指导或帮助都会很棒!如果我所问的内容不在该网站的指导范围内,那么我很抱歉,我会去其他地方寻找。 :)

该项目加载正确拼写的小写单词,然后需要根据两个标准提出拼写建议:

  • 一个字母差异(添加或减去以使单词与字典中的单词相同)。例如,“stack”将是“staick”的建议,“cool”将是“coo”的建议。

  • 一个字母替换。例如,“bad”将是“bod”的建议。

所以,只是为了确保我已经正确解释了..我可能会加载单词[你好,再见,太棒了,好,上帝],然后对(错误拼写的)单词“godd”的建议将是[好,上帝]。

速度是我在这里主要考虑的因素,因此虽然我认为我知道完成这项工作的方法,但我真的不太确定它的效率如何。我考虑的方法是创建一个 map> ,然后对于加载的每个正确拼写的单词,添加正确拼写的工作作为键在地图中并将向量填充为该单词的所有可能的“错误”排列。

然后,当我想要查找一个单词时,我将查看地图中的每个向量,看看该单词是否是拼写正确的单词之一的排列。如果是,我将添加该键作为拼写建议。

但这似乎会占用大量内存,因为每个单词肯定会有数千种排列?如果我最初的拼写正确的单词词典很大,它似乎也会非常非常慢?

我想也许我可以通过只查看与我正在查看的单词相似的键来减少时间。但话又说回来,如果它们在某些方面相似,那么这可能意味着关键将是一个建议,意味着我不需要所有这些排列!

所以,是的,我有点困惑我应该朝哪个方向看。我真的很感激任何帮助,因为我真的不确定如何估计不同做事方式的速度(我们还没有被教导过这一点)完全在课堂上)。

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)

The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:

  • One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.

  • One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.

So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].

Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.

Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.

This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?

I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!

So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

愿与i 2024-12-18 02:53:22

解决问题更简单的方法确实是预先计算的地图[坏话] -> [建议]。

问题是,虽然删除一个字母会产生很少的“坏词”,但对于添加或替换,您有许多候选者。

所以我建议另一种解决方案;)

注意:您描述的编辑距离称为Levenshtein 距离

解决方案以增量步骤进行描述,通常每个想法的搜索速度应该不断提高,我尝试首先用更简单的想法(在实现方面)来组织它们。只要您对结果感到满意,就可以随时停止。


0。初步

  • 实现 Levenshtein 距离算法
  • 将字典存储在排序序列中(例如,通过排序的 std::deque std::vector 性能会更好)

要点:

  • Levenshtein 距离计算使用数组,在每一步中,下一行仅与前一行计算
  • 一行中的最小距离始终优于(或等于)前一行中的最小值

后一个属性允许短路实现:如果您想将自己限制为 2 个错误(阈值),那么每当当前行的最小值优于 2 时,您可以放弃计算。一个简单的策略是返回阈值 + 1 作为距离。


1.第一个尝试

让我们从简单的开始。

我们将实现线性扫描:对于每个单词,我们计算距离(短路),并列出迄今为止实现较小距离的单词。

它在小型词典上效果很好。


2.改进数据结构

编辑距离至少等于长度差。

通过使用(长度,单词)对而不是单词作为键,您可以将搜索限制在长度[length - edit, length + edit]的范围内,并大大减少搜索空间。


3.前缀和修剪

为了改进这一点,我们可以说,当我们逐行构建距离矩阵时,一个世界被完全扫描(我们寻找的单词),但另一个世界(所指对象)却没有:我们每行仅使用一个字母。

这个非常重要的属性意味着对于共享相同初始序列(前缀)的两个指示对象,矩阵的第一行将是相同的。

还记得我要求你存储排序后的字典吗?这意味着共享相同前缀的单词是相邻的。

假设您正在对照 cartoon 检查您的单词,并且在 car 处您意识到它不起作用(距离已经太长),那么任何以 car 开头的单词 也不起作用,您可以跳过以 car 开头的单词。

跳过本身可以通过线性或搜索来完成(找到第一个比 car 具有更高前缀的单词):

  • 如果前缀很长(要跳过的单词很少),则线性效果
  • 最好最适合短前缀(要跳过很多单词)

“长”有多长取决于您的字典,您必须进行测量。我会从二分搜索开始。

注意:长度分区与前缀分区相反,但它会修剪更多的搜索空间


4。前缀和重用

现在,我们还将尝试尽可能地重用计算(而不仅仅是“它不起作用”的结果)

假设你有两个单词:

  • cartoon
  • carwash

你首先逐行计算cartoon 的矩阵。然后在读取carwash时需要确定公共前缀(这里是car)的长度,并且可以保留矩阵的前4行(对应于void,car)。

因此,当开始计算 carwash 时,您实际上是从 w 开始迭代。

为此,只需使用在搜索开始时直接分配的数组,并使其足够大以容纳更大的引用(您应该知道字典中的最大长度是多少)。


5.使用“更好”的数据结构

为了更轻松地使用前缀,您可以使用 Trie 或存储字典的 Patricia 树。然而,它不是 STL 数据结构,您需要对其进行扩充,以在每个子树中存储所存储的单词长度范围,因此您必须自己实现。这并不像看起来那么容易,因为存在内存爆炸问题,可能会杀死局部性。

这是最后的选择。实施起来成本很高。

The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].

The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.

So I would suggest another solution ;)

Note: the edit distance you are describing is called the Levenshtein Distance

The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.


0. Preliminary

  • Implement the Levenshtein Distance algorithm
  • Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)

Keys Points:

  • The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
  • The minimum distance in a row is always superior (or equal) to the minimum in the previous row

The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.


1. First Tentative

Let's begin simple.

We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.

It works very well on smallish dictionaries.


2. Improving the data structure

The levenshtein distance is at least equal to the difference of length.

By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.


3. Prefixes and pruning

To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.

This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.

Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.

Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.

The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):

  • linear works best if the prefix is long (few words to skip)
  • binary search works best for short prefix (many words to skip)

How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.

Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space


4. Prefixes and re-use

Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)

Suppose that you have two words:

  • cartoon
  • carwash

You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).

Therefore, when begin to computing carwash, you in fact begin iterating at w.

To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).


5. Using a "better" data structure

To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.

This is a last resort option. It's costly to implement.

愿与i 2024-12-18 02:53:22

您应该看看 Peter Norvig 关于如何编写拼写校正器的解释。

如何编写拼写校正器

本文中对所有内容进行了很好的解释,作为示例的 python 代码拼写检查器看起来像这样:

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

希望您能在 Peter Norvig 网站上找到您需要的内容。

You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .

How to write a spelling corrector

EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

Hope you can find what you need on Peter Norvig website.

沫雨熙 2024-12-18 02:53:22

对于拼写检查器来说,许多数据结构都很有用,例如 BK-Tree。检查 该死的酷算法,第 1 部分:BK-Trees 我已经完成了相同的实现 此处

我之前的代码链接可能会产生误导,这个对于拼写校正器来说是正确的。

for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here

My Earlier code link may be misleading, this one is correct for spelling corrector.

妞丶爷亲个 2024-12-18 02:53:22

在我的脑海中,您可以根据长度分割您的建议,并构建一个树结构,其中子项是较短父项的较长变体。

应该很快,但我不确定代码本身,我不太熟悉 C++

off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.

should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文