如何使用 aspell 或其他工具找到给定单词的常见拼写错误
对于给定的单词,我想找到 n 个最接近的拼写错误。我想知道像 aspell 这样的开源拼写检查器在这种情况下是否有用,除非您有其他建议。
例如:“健康”
会给我:ealth,halth,health,healf,...
For a given word I'd like to find the n closest misspellings. I was wondering if an open source spell checker like aspell would be useful in that context unless you have other suggestions.
For example: 'health'
would give me: ealth, halth, heallth, healf, ...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
拼写更正工具会接受拼写错误的单词并提供可能的拼写正确的替代词。你似乎想朝另一个方向走。
从一个正确拼写的单词到一组可能拼写错误可能可以通过对常见单词应用一组突变启发法来实现。这些启发式方法可能会执行以下操作:
从拼写正确的单词到一组常见拼写错误确实很困难。做到这一点的唯一可靠方法可能是检测大型用户社区使用的拼写检查器包,记录使用拼写检查器进行的实际拼写更正,并汇总结果。这可能(!)超出了您的项目范围。
重新审视我的答案时,我想我错过了一些东西。
我上面的启发法主要是针对打字错误而不是拼写错误。打字错误是指用户知道正确的拼写但输错了单词。拼写错误是指人们不知道单词的正确拼写,并且使用了不正确的知识或直觉(即猜测)。典型的猜测是基于听单词的发音,然后选择一个最有可能发音的拼写(如果正确)。
因此,预测拼写错误的良好启发式方法需要基于单词在发音时的实际发音。这需要一个语音词典(从实际单词到其发音)和一组为语音单词生成合理拼写的规则。这比简单的打字错误启发法更复杂。
Spelling correction tools take misspelled words and offer possible correctly spelled alternatives. You seem to want to go in the other direction.
Going from a correctly spelled word to a set of possible misspellings could probably be performed by applying a set of mutation heuristics to common words. These heuristics might do things like:
Going from a correctly spelled word to a set of common misspellings is really hard. Probably the only reliable way to do this would be to instrument a spelling checker package used by a large community of users, record the actual spelling corrections made using the spelling checker, and aggregate the results. That is probably (!) beyond the scope of your project.
On revisiting my answer, I think I've missed something.
My heuristics above are mostly for typing error rather than misspellings. A typing error is where the user knows the correct spelling but mistyped the word. A misspelling is where the person doesn't know the correct spelling of a word, and uses either incorrect knowledge or intuition (i.e. a guess). Typical guesses are based on listening to what the word sounds like, and then pick a spelling that (if correct) would most likely be pronounced that way.
So an good heuristic for predicting misspellings would need to be based what the word actually sounds like when spoken. That requires a phonetic dictionary (to go from the actual word to its pronunciation) and a set of rules for generating plausible spellings for the phonetic word. That's more complicated than simple heuristics for typing errors.