具有融合拼写错误纠正算法的拼写检查器

发布于 2025-01-05 15:30:36 字数 596 浏览 1 评论 0原文

最近,我研究了几种拼写检查算法,包括简单的算法(例如 Peter Norvig 的)等等复杂的(如布里尔和摩尔的)。但有一类错误是它们都无法处理的。例如,如果我输入 stackoverflow 而不是 stackoverflow,这些拼写检查程序将无法纠正错误类型(除非术语词典中的 stackoverflow) 。存储所有单词对的成本太高(如果错误是 3 个单词之间没有空格,这将无济于事)。 是否有一种算法可以纠正(尽管通常会出现错误)此类错误?

我需要的一些示例:
拼写检查器 -> 拼写检查器
拼写检查器 -> 拼写检查器
spelcheker -> 拼写检查器

Recently I've looked through several spell checker algorithms including simple ones(like Peter Norvig's) and much more complex (like Brill and Moore's) ones. But there's a type of errors which none of them can handle. If for example I type stackoverflow instead of stack overflow these spellcheckers will fail to correct the mistype (unless the stack overflow in the dictionary of terms). Storing all the pairs of words is too expensive (and it will not help if the error is 3 single words without spaces between them).
Is there an algorithm which can correct (despite usual mistypes) this type of errors?

Some examples of what I need:
spel checker -> spell checker
spellchecker -> spell checker
spelcheker -> spell checker

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜司空 2025-01-12 15:30:37

我有时在 kate 中进行拼写检查时会得到这样的建议,所以肯定有一种算法可以纠正一些此类错误。我确信可以做得更好,但一个想法是将候选者拆分到可能的位置,并检查组件是否存在紧密匹配。困难的部分是确定可能的位置。在我熟悉的语言中,有些字母组合在单词中很少出现。例如,据我所知,组合 dklh 在英语单词中很少见。其他组合通常出现在单词的开头(例如 unch),因此这些组合也是分割的好猜测。在 spelcheker 示例中,lc 组合并不太普遍,而 ch 是常见的单词开头,因此拆分 spelcheker 是主要候选者,任何像样的算法都会找到 spellchecker (但它也可能会找到 < code>spiel,所以不要自动更正,只是提出建议)。

I sometimes get such suggestions when spell-checking in kate, so there certainly is an algorithm that can correct some such errors. I am sure one can do better, but one idea is to split the candidate in likely places and check whether close matches for the components exist. The hard part is to decide what are likely places. In the languages I'm sort of familiar with, there are letter combinations that occur rarely in words. For example, the combinations dk or lh are, as far as I'm aware rare in English words. Other combinations occur often at the start of words (e.g. un, ch), so those would be good guesses for splitting too. In the example spelcheker, the lc combination is not too widespread, and ch is a common start of words, so the split spel and cheker is a prime candidate, and any decent algorithm would then find spell and checker (but it would probably also find spiel, so don't auto-correct, just give suggestions).

小红帽 2025-01-12 15:30:36

我破解了 Norvig 的拼写校正器来做到这一点。我不得不作一些作弊,将“检查器”一词添加到 Norvig 的数据文件中,因为它从未出现。如果没有这种作弊,这个问题真的很难。

expertsexchange expert exchange
spel checker spell checker
spellchecker spell checker
spelchecker she checker  # can't win them all
baseball base all  # baseball isn't in the dictionary either :(
hewent he went

基本上,您需要更改代码,以便:

  • 向字母表添加空格以自动探索断词。
  • 您首先检查组成短语的所有单词是否都在字典中,以认为该短语有效,而不仅仅是直接字典成员资格(字典不包含短语)。
  • 你需要一种方法来根据简单单词对短语进行评分。

后者是最棘手的,我对短语组合使用了一个脑残的独立假设,即两个相邻单词的概率是它们各自概率的乘积(这里用对数概率空间中的总和完成),并有一个小的惩罚。我确信在实践中,您会想要保留一些二元组统计数据以很好地进行拆分。

import re, collections, math

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
  counts = collections.defaultdict(lambda: 1.0)
  for f in features:
    counts[f] += 1.0
  tot = float(sum(counts.values()))
  model = collections.defaultdict(lambda: math.log(.1 / tot))
  for f in counts:
    model[f] = math.log(counts[f] / tot)
  return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz '

def valid(w):
  return all(s in NWORDS for s in w.split())

def score(w):
  return sum(NWORDS[s] for s in w.split()) - w.count(' ')

def edits1(word):
  splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
  deletes    = [a + b[1:] for a, b in splits if b]
  transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
  replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
  inserts    = [a + c + b     for a, b in splits for c in alphabet]
  return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
  return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if valid(e2))

def known(words): return set(w for w in words if valid(w))

def correct(word):
  candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
  return max(candidates, key=score)

def t(w):
  print w, correct(w)

t('expertsexchange')
t('spel checker')
t('spellchecker')
t('spelchecker')
t('baseball')
t('hewent')

I hacked up Norvig's spell corrector to do this. I had to cheat a bit and add the word 'checker' to Norvig's data file because it never appears. Without that cheating, the problem is really hard.

expertsexchange expert exchange
spel checker spell checker
spellchecker spell checker
spelchecker she checker  # can't win them all
baseball base all  # baseball isn't in the dictionary either :(
hewent he went

Basically you need to change the code so that:

  • you add space to the alphabet to automatically explore the word breaks.
  • you first check that all of the words that make up a phrase are in the dictionary to consider the phrase valid, rather than just dictionary membership directly (the dict contains no phrases).
  • you need a way to score a phrase against plain words.

The latter is the trickiest, and I use a braindead independence assumption for phrase composition that the probability of two adjacent words is the product of their individual probabilities (here done with sum in log prob space), with a small penalty. I am sure that in practice, you'll want to keep some bigram stats to do that splitting well.

import re, collections, math

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
  counts = collections.defaultdict(lambda: 1.0)
  for f in features:
    counts[f] += 1.0
  tot = float(sum(counts.values()))
  model = collections.defaultdict(lambda: math.log(.1 / tot))
  for f in counts:
    model[f] = math.log(counts[f] / tot)
  return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz '

def valid(w):
  return all(s in NWORDS for s in w.split())

def score(w):
  return sum(NWORDS[s] for s in w.split()) - w.count(' ')

def edits1(word):
  splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
  deletes    = [a + b[1:] for a, b in splits if b]
  transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
  replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
  inserts    = [a + c + b     for a, b in splits for c in alphabet]
  return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
  return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if valid(e2))

def known(words): return set(w for w in words if valid(w))

def correct(word):
  candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
  return max(candidates, key=score)

def t(w):
  print w, correct(w)

t('expertsexchange')
t('spel checker')
t('spellchecker')
t('spelchecker')
t('baseball')
t('hewent')
甜宝宝 2025-01-12 15:30:36

这个问题与应用于德语或荷兰语的复合分裂问题非常相似,但也与嘈杂的英语数据有关。请参阅Monz 和De Rijke 提出了一个非常简单的算法(我认为可以将其实现为有限状态转换器以提高效率),而 Google 则提出“复合分裂”和“分解”。

This problem is very similar to the problem of compound splitting as applied to German or Dutch, but also noisy English data. See Monz & De Rijke for a very simple algorithm (which can I think be implemented as a finite state transducer for efficiency) and Google for "compound splitting" and "decompounding".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文