当前位置：文江博客话题详情

上下文特定拼写引擎

发布于 2024-07-24 08:53:30 字数 158 浏览 11 评论 0原文

我相信你们中不少人都看过 Google Wave 演示。我特别想知道拼写检查技术。通过找出单词在句子中的上下文位置来提出这些建议的拼写检查器有多么革命性？

我以前没有见过这种技术，但是其他地方有这样的例子吗？
如果是的话，其工作原理中是否有代码示例和文献？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千年*琉璃梦 2024-07-31 08:53:30

我的2分钱。鉴于 translate.google.com 是一个统计机器翻译引擎，以及 A Halevy、P Norvig（Google 研究总监）和 A Halevy 的“数据的不合理有效性”， F Pereira：我假设（打赌）这是一个统计驱动的拼写检查器。

它是如何工作的：你收集一个非常大的你想要进行拼写检查的语言的语料库。您可以将此语料库存储为改编数据结构中的短语表（后缀数组，例如，如果您必须计数跟踪计数的 n-gram 子集（因此估计概率) n 元语法的数量。

例如，如果您的语料库仅由以下组成：

I had bean soup last diner.

从该条目，您将生成以下二元语法（2 个单词的集合）：

I had, had bean, bean soup, soup last, last diner

和三元语法（3 个单词的集合）：

I had bean, had bean soup, bean soup last, soup last diner

但它们将通过以下测试进行修剪：统计相关性，例如：我们可以假设三元组

I had bean

将从短语表中消失。

现在，拼写检查只会查看这个大短语表并检查“概率”。（你需要一个良好的基础设施来将这个短语表存储在有效的数据结构和 RAM 中，Google 为 translate.google.com 提供了它，为什么不呢？它比统计机器翻译更容易。）

例如：你输入

I had been soup

并输入短语表中有一个

had bean soup

三元组，其概率比您刚刚输入的要高得多！事实上，你只需要改变一个单词（这是一个“不太遥远”的三元组）就可以得到一个概率更高的三元组。应该有一个评估函数来处理距离/概率的权衡。这个距离甚至可以用字符来计算：我们正在进行拼写检查，而不是机器翻译。

这只是我的假设意见。 ;)

My 2 cents. Given the fact that translate.google.com is a statistical machine translation engine and "The Unreasonable Effectiveness of Data" from A Halevy, P Norvig (Director of Research at Google) & F Pereira: I make the assumption (bet) that this is a statistically driven spell checker.

How it could work: you collect a very large corpus of the language you want to spell check. You store this corpus as phrase-tables in adapted datastructures (suffix arrays for example if you have to count the n-grams subsets) that keep track of the count (an so an estimated probability of) the number of n-grams.

For example, if your corpus is only constitued of:

I had bean soup last diner.

From this entry, you will generate the following bi-grams (sets of 2 words):

I had, had bean, bean soup, soup last, last diner

and the tri-grams (sets of 3 words):

I had bean, had bean soup, bean soup last, soup last diner

But they will be pruned by tests of statistical relevance, for example: we can assume that the tri-gram

I had bean

will disappear of the phrase-table.

Now, spell checking is only going to look is this big phrase-tables and check the "probabilities". (You need a good infrastructure to store this phrase-tables in an efficient data structure and in RAM, Google has it for translate.google.com, why not for that ? It's easier than statistical machine translation.)

Ex: you type

I had been soup

and in the phrase-table there is a

had bean soup

tri-gram with a much higher probability than what you just typed! Indeed, you only need to change one word (this is a "not so distant" tri-gram) to have a tri-gram with a much higher probability. There should be an evaluating function dealing with the trade-off distance/probability. This distance could even be calculated in terms of characters: we are doing spell checking, not machine translation.

This is only my hypothetical opinion. ;)

回复收藏 0 原文