上下文特定拼写引擎
我相信你们中不少人都看过 Google Wave 演示。 我特别想知道拼写检查技术。 通过找出单词在句子中的上下文位置来提出这些建议的拼写检查器有多么革命性?
我以前没有见过这种技术,但是其他地方有这样的例子吗?
如果是的话,其工作原理中是否有代码示例和文献?
I'm sure more than a few of you will have seen the Google Wave demonstration. I was wondering about the spell checking technology specificially. How revolutionary is a spell checker which works by figuring out where a word appears contextually within a sentence to make these suggestions ?
I haven't seen this technique before, but are there examples of this elsewhere?
and if so are there code examples and literature into its workings ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我的2分钱。 鉴于 translate.google.com 是一个统计机器翻译引擎,以及 A Halevy、P Norvig(Google 研究总监)和 A Halevy 的“数据的不合理有效性”, F Pereira:我假设(打赌)这是一个统计驱动的拼写检查器。
它是如何工作的:你收集一个非常大的你想要进行拼写检查的语言的语料库。 您可以将此语料库存储为改编数据结构中的短语表(后缀数组,例如,如果您必须计数跟踪计数的 n-gram 子集(因此估计概率) n 元语法的数量。
例如,如果您的语料库仅由以下组成:
从该条目,您将生成以下二元语法(2 个单词的集合):
和三元语法(3 个单词的集合):
但它们将通过以下测试进行修剪:统计相关性,例如:我们可以假设三元组
将从短语表中消失。
现在,拼写检查只会查看这个大短语表并检查“概率”。 (你需要一个良好的基础设施来将这个短语表存储在有效的数据结构和 RAM 中,Google 为 translate.google.com 提供了它,为什么不呢?它比统计机器翻译更容易。)
例如:你输入
并输入短语表中有一个
三元组,其概率比您刚刚输入的要高得多! 事实上,你只需要改变一个单词(这是一个“不太遥远”的三元组)就可以得到一个概率更高的三元组。 应该有一个评估函数来处理距离/概率的权衡。 这个距离甚至可以用字符来计算:我们正在进行拼写检查,而不是机器翻译。
这只是我的假设意见。 ;)
My 2 cents. Given the fact that translate.google.com is a statistical machine translation engine and "The Unreasonable Effectiveness of Data" from A Halevy, P Norvig (Director of Research at Google) & F Pereira: I make the assumption (bet) that this is a statistically driven spell checker.
How it could work: you collect a very large corpus of the language you want to spell check. You store this corpus as phrase-tables in adapted datastructures (suffix arrays for example if you have to count the n-grams subsets) that keep track of the count (an so an estimated probability of) the number of n-grams.
For example, if your corpus is only constitued of:
From this entry, you will generate the following bi-grams (sets of 2 words):
and the tri-grams (sets of 3 words):
But they will be pruned by tests of statistical relevance, for example: we can assume that the tri-gram
will disappear of the phrase-table.
Now, spell checking is only going to look is this big phrase-tables and check the "probabilities". (You need a good infrastructure to store this phrase-tables in an efficient data structure and in RAM, Google has it for translate.google.com, why not for that ? It's easier than statistical machine translation.)
Ex: you type
and in the phrase-table there is a
tri-gram with a much higher probability than what you just typed! Indeed, you only need to change one word (this is a "not so distant" tri-gram) to have a tri-gram with a much higher probability. There should be an evaluating function dealing with the trade-off distance/probability. This distance could even be calculated in terms of characters: we are doing spell checking, not machine translation.
This is only my hypothetical opinion. ;)
您还应该观看 Google Wave 团队的 Casey Whitelaw 制作的官方视频,其中介绍了所使用的技术:http ://www.youtube.com/watch?v=Sx3Fpw0XCXk
You should also watch an official video by Casey Whitelaw of the Google Wave team that describes the techniques used: http://www.youtube.com/watch?v=Sx3Fpw0XCXk
您可以通过深入研究自然语言处理来了解有关此类主题的所有内容。 您甚至可以深入地统计猜测一串给定单词之后接下来会出现哪个单词。
如果您对这样的主题感兴趣,我强烈建议使用完全用 python 编写的 NLTK(自然语言工具包)。 这是一项非常广泛的工作,有很多工具和非常好的文档。
You can learn all about topics like this by diving into natural language processing. You can even go as in-depth as making a statistical guess as to which word will come next after a string of given words.
If you are interested in such a topic, I highly suggest using the NLTK (natural language toolkit) written entirely in python. it is a very expansive work, having many tools and pretty good documentation.
关于这个主题有很多论文。 这里有一些很好的资源,
它不使用上下文敏感性,但它是一个很好的构建基础
http://norvig.com/spell- Correct.html
这可能是一个很好且简单的方法了解更强大的拼写检查器的视图
http://acl.ldc.upenn.edu/acl2004/emnlp/ pdf/Cucerzan.pdf
从这里您可以深入了解细节。 我建议使用谷歌学术并查找上面论文中的参考文献,然后搜索“拼写更正”
There are a lot of papers on this subject. Here are some good resources
This doesn't use context sensitivity, but it's a good base to build from
http://norvig.com/spell-correct.html
This is probably a good and easy to understand view of a more powerful spell checker
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Cucerzan.pdf
From here you can dive deep on the particulars. I'd recommend using google scholar and looking up the references in the paper above, and searching for 'spelling correction'