有关纠正拼写错误单词所需的编辑操作频率的数据
有人知道与人们拼写错误单词时所犯错误类型的频率相关的任何数据吗?我指的不是文字本身,而是打字员所犯的错误。例如,我个人最常犯的换位错误是删除错误(即,不包括我应该包含的字母)、替换错误,最后是插入错误。然而,如果我发现输入错误的字母(替换错误,例如 xat 而不是 cat)比不包含字母更常见,我不会感到惊讶。
我的目的是当我只有原始用户的输入时,能够对更正单词做出最佳猜测。这个想法是,如果一种类型的错误比其他类型的错误更频繁,那么通过该类型的操作纠正单词更有可能是正确的。我不反对使用常见拼写错误单词的数据库,但我更喜欢依赖于语料库的算法解决方案 - 特别是如果它可能更快的话。
Does anybody know of any data that relates to the frequency of the types of mistakes the people make when they misspell a word? I'm not referring to words themselves, but tje errors that are made by the typist. For example, I personally make transposition errors the most followed by deletion errors (that is, not including a letter I should), substitution errors and lastly, insertion errors. However, it would not surprise me to find out that typing a wrong letter (a substitution error, e.g., xat instead of cat) is more frequent than not including a letter.
My purpose is to be able to make best guesses at correcting a word when I only have the original user's input. The idea being that if one type of error is more frequent than others, then it's more likely that correcting a word via that type of operation is correct. I don't object to using a database of commonly misspelt words but I prefer an algorithmic solution to depending on a corpus--especially if it might be faster.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以尝试使用诸如计算输入错误的单词与字典中的单词之间的 Levenshtein 距离之类的方法。我不确定这就是你想要的。
You could try using something like calculating the Levenshtein distance between the mistyped word and the words in a dictionary. I'm not sure that's what you want though.