连词的拼写建议

发布于 2024-08-11 05:30:14 字数 335 浏览 6 评论 0原文

我正在致力于为基于网络的所见即所得编辑器实现拼写检查功能。我目前正在使用 Damerau-Levenshtein 距离算法来生成拼写建议列表。这一切都很顺利,但我很好奇如何改进功能。

具体来说,我的实现当前不处理连接词。例如,我希望能够检测“areyou”并建议“are you”。我想我可以通过将可能连接的单词在可能看起来的片段上分开并测试两半来做到这一点。由于所有英语单词都必须至少有一个元音,我想我可以寻找元音来帮助我决定在哪里分解单词。

Damerau-Levenshtein 距离算法非常有用;很明显,其他人对此比我投入了更多的思考。我是否应该考虑使用类似的聪明算法来检测连体单词,或者我已经走在正确的轨道上了?

I am working on implementing a spell check function for a web-based WYSIWYG editor. I am currently using the Damerau-Levenshtein distance algorithm to produce a list of spelling suggestions. This is all working out nicely, but I am curious as to how I might improve the functionality.

Specifically, my implementation does not currently handle conjoined words. For instance, I would like to be able to detect "areyou" and suggest "are you" instead. I think I can do this by breaking the potentially conjoined word apart at likely looking segments and testing both halves. Since all English words must have at least one vowel, I think I can look for vowels to help me decide where to break words apart.

The Damerau-Levenshtein distance algorithm was so useful; it is clear that others have put a lot more thought into this than I have. Is there a similarly clever algorithm that I should consider for detecting conjoined words, or am I on the right track already?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

对你而言 2024-08-18 05:30:14

我想候选连词不会超过四十(40)个字符左右;大多数时候它会小于十 (10)。

考虑到体积小,这个伪代码怎么样?

if (is_spelled_wrong(word)):
    N = len(word)
    list_suggestions = []
    for i = 1 to N-1:
        wordA = word[0:i] // Pythonic 'slice' notation
        wordB = word[i+1:N]
        if (!is_spelled_wrong(wordA) && !is_spelled_wrong(wordB))
            list_suggestions.appened((wordA, wordB))

换句话说,只需扫描字符串以查找所有可能性。他们的数量很少。对于“areyou”,您将循环五 (5) 次。

I imagine the candidate conjoined word will not be longer than forty (40) characters or so; most of the time it will be less than ten (10).

Considering the small size, what about this pseudocode?

if (is_spelled_wrong(word)):
    N = len(word)
    list_suggestions = []
    for i = 1 to N-1:
        wordA = word[0:i] // Pythonic 'slice' notation
        wordB = word[i+1:N]
        if (!is_spelled_wrong(wordA) && !is_spelled_wrong(wordB))
            list_suggestions.appened((wordA, wordB))

In other words, just scan the string for all possibilities. There are a small number of them. In the case of "areyou", you would loop five (5) times.

自由范儿 2024-08-18 05:30:14

由于您已经阅读了整个词典中的每个单词,因此将常见的单词对添加到词典中并不是非常低效。或者,您可以以所有可能的方式将输入(可能是连接的单词)分成两个单词,然后在字典中查找每个单词附近的单词。它并不像听起来那么慢——您可以使用单词的深度学习中间结果来获取其前缀的结果。

As you are already reading the whole dictionary for every word, it wouldn't be terribly inefficient to append common pairs of words to the dictionary. Alternatively, you can divide the input (possibly conjoined word) into two words in all possible ways and then look for words near each of them in the dictionary. It isn't as slow as it sounds -- you can use DL intermediate results of a word to get the results for its prefix.

失与倦" 2024-08-18 05:30:14

查看这篇关于编写拼写检查器的优秀文章。使用该技术,您有两个选择:要么包含字典中的每对单词,或每对可能的单词(以分隔的单词作为解决方案),要么尝试每个可能的分割点并进行标准字典查找以查看是否这两个词都有效。

Check out this excellent article on writing a spelling checker. Using that technique, you have two options: Either include every pair of words, or every likely pair of words in the dictionary (with the separated words as the solution), or try every possible split point and do a standard dictionary lookup to see if both words are valid.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文