构建修复拼写错误的自然语言模型

发布于 2024-08-20 13:15:14 字数 288 浏览 8 评论 0原文

关于如何构建这样的自然语言解析程序的书籍是什么:

input: I got to TALL you
output: I got to TELL you

input: Big RAT box
output: Big RED box

in: hoo un thum zend three
out: one thousand three

它必须具有允许预测哪些单词拼写错误的语言模型!

关于如何构建这样一个工具的最好的书籍是什么?

ps 有免费的拼写检查网络服务吗?也许来自谷歌?...

What are books about how to build a natural language parsing program like this:

input: I got to TALL you
output: I got to TELL you

input: Big RAT box
output: Big RED box

in: hoo un thum zend three
out: one thousand three

It must have the language model that allows to predict what words are misspelled !

What are the best books on how to build such a tool??

p.s. Are there free webservices to spell-check? From Google maybe?..

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

谁与争疯 2024-08-27 13:15:14

Peter Norvig 编写了一个非常棒的拼写检查器。也许这可以帮助你。

Peter Norvig has written a terrific spell checker. Maybe that can help you.

只涨不跌 2024-08-27 13:15:14

您至少有三个选择

  1. 您可以编写一个能够理解该语言(即单词含义)的程序。这是今天要研究的一个课题。当您可以购买一台足够快的计算机来运行这样的程序时,您就可以期待第一个结果(这可能需要 10 年后计算机的速度比今天快 1000 倍)。

  2. 使用庞大的语料库(文本文档)来训练隐马尔可夫模型

  3. 使用庞大的语料库并生成有关四元组 n-gram 的统计信息,即 N 个单词的元组出现的频率。我没有方便的链接,但想法是某些单词总是出现在其他单词的上下文中。因此,当您将文本解析为 4 克并在数据库中查找它们时却找不到,则当前元组可能存在问题。下一步是找到所有可能的匹配(其他 4-gram 具有较小的 soundex 或与当前匹配的距离相似),并尝试频率最高的匹配。

    Google 拥有多种语言的此类数据,您可能会在 Google 实验室中找到更多相关信息。

[编辑]经过一番谷歌搜索,我终于找到了链接:在此页面上,你可以购买谷歌在整个互联网上收集的英语1-5克的6张DVD。

谷歌搜索“google 拼写统计 n-grams”也会出现一些有趣的链接。

You have at least three options

  1. You can write a program which understands the language (i.e. what a word means). This is a topic for research today. Expect the first results when you can buy a computer which is fast enough to run such a program (which is probably in 10 years when computers have become 1000 times faster than today).

  2. Use a huge corpus (text documents) to train a Hidden Marcov Model.

  3. Use a huge corpus and generate statistics about quadruplets n-grams, i.e. how often a tuple of N words appears. I don't have a link handy for this but the idea is that some words always appear in the context of other words. So when you parse your text into 4-grams and look them up in your database and you can't find one, chances are that there is something wrong with the current tuple. The next step is to find all possible matches (other 4-grams which have a small soundex or similar distance to the current one) and try the one with the highest frequency.

    Google has this data for quite a few languages and you might find more in Google labs about this.

[EDIT] After some googling, I finally found the link: On this page, you can buy English 1- to 5-grams which Google collected over the whole Internet on 6 DVDs.

Googling for "google spelling statistics n-grams" will also turn up some interesting links.

云胡 2024-08-27 13:15:14

soundex (wiki) 是一种选择

soundex (wiki) is one option

-柠檬树下少年和吉他 2024-08-27 13:15:14

有很多用于自然语言处理的 Java 库可以帮助您实现拼写校正器。但你问的是一本书。 Christopher D. Manning 和 Hinrich Schütze 的统计自然语言处理基础看起来是一个不错的选择。第一作者是一位斯坦福大学教授,领导着一个小组,从事自然语言处理和开发许多人使用的 Java 库和 NLP 资源。

There are quite a few Java libraries for natural language processing that would help you implement a spelling corrector. But you asked about a book. Foundations of Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze looks like a good option. The first author is a Stanford Professor leading a group that does natural language processing and developing Java libraries and NLP resources that many people use.

淑女气质 2024-08-27 13:15:14

伦敦开发日中,Michael Sparks 展示了一个专门为此编码的 Python 脚本。令人惊讶的是非常简单!看看谷歌里能不能找到。也许这里有人会有链接。

In Dev Days London, Michael Sparks presented a Python script coded exactly for that. It was surprisingly very simple! See if you can find in Google. Maybe somebody here will have the link.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文