创建建议词算法

发布于 2024-11-03 00:51:41 字数 310 浏览 4 评论 0原文

我正在设计一个很酷的拼写检查器(我知道我知道,现代浏览器已经有了这个),无论如何,我想知道开发一个相当简单但不错的建议词算法需要付出什么样的努力。

我的想法是,我首先查看拼写错误的单词的字符,并计算它在字典中每个单词中匹配的字符数量(听起来需要大量资源),然后选择前 5 个匹配项(因此,如果拼写错误的单词与最多的字符匹配)字典中有 7 个单词,它将随机显示其中 5 个单词的建议拼写)。

显然,为了更高级,我们会查看“常用单词”并拥有一个字典文件,该文件按照“该单词在英语中使用的频率”排名进行编号。我认为这可能有点过分了。

你怎么认为?有人对此有想法吗?

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.

My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).

Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.

What do you think? Anyone have ideas for this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

疧_╮線 2024-11-10 00:51:41

首先,您必须考虑查找与拼写错误的单词“更接近”的单词的复杂性。我看到你正在使用字典,也许是哈希表。但这可能还不够。这里最好、更酷的解决方案是采用 TRIE 数据结构。找到这些所谓的较近单词的复杂性将需要线性顺序计时,并且很容易耗尽树。

一个小例子

以单词“njce”为例。这是一个 1 级示例,其中一个单词明显拼写错误。预期的明显建议会很好。第一步很明显是看这个词是否出现在字典中。使用 TRIE 的搜索功能,这可以在 O(1) 时间内完成,类似于字典。更酷的部分是寻找建议。显然,您必须穷尽所有以“a”到“z”开头的单词,其中包含 ajce bjce cjce upto zjce 等单词。现在,找到这种类型的出现次数又是线性的,具体取决于字符数。您不应该因将这个数字乘以 26 个单词长度而得意忘形。由于 TRIE 随着长度的增加而立即减小。回到问题上来。一旦搜索完成但未找到结果,您将转到下一个字符。现在您将搜索 nace nbce ncce 直至 nzce。事实上,您不会探索所有组合,因为 TRIE 数据结构本身不具有中间字符。也许它会有 na ni ne no nu 字符,并且搜索空间变得非常简单。进一步发生的情况也是如此。您可以根据二阶和三阶匹配进一步发展这个概念。希望这有帮助。

First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.

A small example

Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.

吾家有女初长成 2024-11-10 00:51:41

我不确定您想要重新发明多少轮子,因此您可能需要查看 Lucene

Apache Lucene Core™(以前称为 Lucene Java)是我们的旗舰子项目,提供基于 Java 的索引和搜索实现,以及拼写检查、命中突出显示和高级分析/标记化功能。

I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.

Apache Lucene Core™ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文