现实世界中的拼写错误统计?
在哪里可以找到一些现实世界中的拼写错误统计数据?
我试图将人们的输入文本与内部对象相匹配,但人们往往会犯拼写错误。
有两种错误:
拼写错误
- “Hello”而不是“Hello”/“Satudray”而不是“Saturday”等。拼写
- “Shikago”而不是“芝加哥”
我使用 Damerau-Levenshtein 距离 来处理拼写错误,用于拼写的双元音位(Python 实现 此处 和 此处)。
我想重点关注 Damerau-Levenshtein(或简单的编辑距离
)。教科书的实现始终使用“1”作为删除、插入、替换和转置的权重。虽然这很简单并且允许很好的算法,但它与“现实”/“现实世界的概率”不匹配。
示例:
- 我确信“Helllo”(“Hello”)的可能性大于“Helzlo”,但它们的编辑距离都为 1。
- 在 QWERTY 键盘上,“Gello”比“Qello”更接近“Hello”。
- Unicode 音译:“München”和“Munchen”之间的“真实”距离是多少?
删除、插入、替换和转置的“现实世界”权重应该是多少?
甚至 Norvig 非常酷的拼写纠正器也使用非加权编辑距离。
顺便说一句-我确信权重需要是函数而不是简单的浮点数(根据上面的 示例)...
我可以调整算法,但是我在哪里可以“学习”这些权重?我无法访问 Google 规模数据...
我应该猜测一下吗?
编辑 - 尝试回答用户问题:
- 由于上述原因,我当前的非加权算法在遇到拼写错误时经常失败。 “星期四回归”:每个“真人”都可以轻松判断星期四比星期二更有可能,但它们都相差 1 编辑距离! (是的,我确实记录并衡量我的表现)。
- 我正在开发一个 NLP 旅行搜索引擎,因此我的字典包含约 25K 个目的地(预计将增长到 100K)、时间表达约 200 个(预计 1K)、人物表达约 100 个(预计 300 个)、金钱表达约 100 个(预计 500 个) )、“粘合逻辑词”(“from”、“beautiful”、“apartment”)~2K(预期 10K)等等...
- 上述每个词组的编辑距离的使用都不同。我尝试“明显时自动更正”,例如,与词典中仅 1 个其他单词相距 1 个编辑距离。我有许多其他手动调整的规则,例如双变音修复,它与长度> 的字典单词的编辑距离不超过2个。 4...随着我从现实世界的输入中学习,规则列表不断增长。
- “有多少对字典条目在您的阈值内?”:好吧,这取决于“奇特的加权系统”和现实世界(未来)的输入,不是吗?不管怎样,我有广泛的单元测试,这样我对系统所做的每一个改变只会让它变得更好(当然,基于过去的输入)。大多数 6 个字母以下的单词与某个单词的编辑距离在 1 个以内,而该单词与另一个词典条目的编辑距离也只有 1 个编辑距离。
- 今天,当有 2 个字典条目与输入的距离相同时,我尝试应用各种统计数据来更好地猜测用户的意思(例如,法国巴黎比伊朗帕里兹更有可能出现在我的搜索中)。
- 选择错误单词的代价是向最终用户返回半随机(通常是荒谬的)结果,并可能失去客户。不理解的成本稍微便宜一些:用户将被要求重新措辞。
- 复杂性的代价值得吗?是的,我确信是的。你不会相信人们向系统抛出的拼写错误数量并期望它能够理解,而且我肯定可以使用
Where can I find some real world typo statistics?
I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:
typos
- "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.Spelling
- "Shikago" instead of "Chicago"
I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).
I want to focus on the Damerau-Levenshtein (or simply edit-distance
). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".
Examples:
- I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
- "Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
- Unicode transliterations: What is the "real" distance between "München" and "Munchen"?
What should the "real world" weights be for deletions, insertions, substitutions, and transpositions?
Even Norvig's very cool spell corrector uses non-weighted edit distance.
BTW- I'm sure the weights need to be functions and not simple floats (per the above
examples)...
I can adjust the algorithm, but where can I "learn" these weights? I don't have access to Google-scale data...
Should I just guess them?
EDIT - trying to answer user questions:
- My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
- I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
- Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
- "How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighting system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
- Today when there are 2 dictionary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
- The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially losing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
- Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
现实世界的拼写错误统计数据的可能来源是维基百科的完整编辑历史。
http://download.wikimedia.org/
另外,您可能对 AWB 的 RegExTypoFix 感兴趣
http://en.wikipedia.org/wiki/Wikipedia:AWB/T
Possible source for real world typo statistics would be in the Wikipedia's complete edit history.
http://download.wikimedia.org/
Also, you might be interested in the AWB's RegExTypoFix
http://en.wikipedia.org/wiki/Wikipedia:AWB/T
我建议您检查三元算法。在我看来,它更适合查找拼写错误然后编辑距离算法。它也应该工作得更快,如果你将字典保存在 postgres 数据库中,你可以使用索引。
您可能会发现有用的 stackoverflow 有关 google“Did”的主题你的意思是”
I would advise you to check the trigram alogrithm. In my opinion it works better for finding typos then edit distance algorithm. It should work faster as well and if you keep dictionary in postgres database you can make use of index.
You may find useful stackoverflow topic about google "Did you mean"
Church 和 Gale 的拼写纠正概率评分可能会有所帮助。在那篇论文中,作者将打字错误建模为作者和计算机之间的嘈杂通道。附录包含美联社出版物语料库中出现的拼写错误表格。以下每种拼写错误都有一个表:
例如,检查插入表,我们可以看到 l 被错误地插入到 l 后面 128次(该列中的最高数字)。使用这些表格,您可以生成您正在寻找的概率。
Probability Scoring for Spelling Correction by Church and Gale might help. In that paper, the authors model typos as a noisy channel between the author and the computer. The appendix has tables for typos seen in a corpus of Associated Press publications. There is a table for each of the following kinds of typos:
For example, examining the insertion table, we can see that l was incorrectly inserted after l 128 times (the highest number in that column). Using these tables, you can generate the probabilities you're looking for.
如果您对这项研究感兴趣,我认为继续使用该算法,尝试找到合适的权重将是富有成效的。
我无法帮助你处理拼写错误统计信息,但我认为你也应该使用 python 的 difflib。具体来说就是SequenceMatcher的ratio()方法。它使用文档 http://docs.python.org/library/difflib.html 的算法 声明非常适合“看起来正确”的匹配,并且可能有助于增强或测试您正在做的事情。
对于只是寻找拼写错误的 Python 程序员来说,这是一个很好的起点。我的一位同事同时使用了 Levenshtein 编辑距离和 SequenceMatcher 的ratio(),并从ratio() 中获得了更好的结果。
If the research is your interest I think continuing with that algorithm, trying to find decent weights would be fruitful.
I can't help you with typo stats, but I think you should also play with python's difflib. Specifically, the ratio() method of SequenceMatcher. It uses an algorithm which the docs http://docs.python.org/library/difflib.html claim is well suited to matches that 'look right', and may be useful to augment or test what you're doing.
For python programmers just looking for typos it is a good place to start. One of my coworkers has used both Levenshtein edit distance and SequenceMatcher's ratio() and got much better results from ratio().
向您提出一些问题,帮助您确定是否应该问“我在哪里可以找到真实世界的权重”问题:
您是否真正测量过统一加权实施的有效性?如何?
你有多少个不同的“内部对象”——即你的字典有多大?
您实际上如何使用编辑距离,例如 John/Joan、Marmaduke/Marmeduke、Featherstonehaugh/Featherstonhaugh:是“全部 1 错误”还是 25%/11.1%/5.9% 差异?您使用什么阈值?
有多少对字典条目在您的阈值内(例如,John vs Joan、Joan vs Juan 等)?如果您引入了一个奇特的加权系统,有多少对字典条目会(a)从阈值内部迁移到阈值外部(b)反之亦然?
如果 John 和 Juan 都在您的字典中并且用户输入 Joan,您会怎么做?
(1) 选择错误的字典单词(不是用户想要的单词)(2) 无法识别用户的输入会产生哪些惩罚/成本?
引入复杂的加权系统实际上会充分降低上述两种错误类型的概率,从而使复杂性和较慢的速度值得吗?
顺便说一句,你怎么知道用户使用的是什么键盘?
更新:
“”“由于上述原因,我当前的非加权算法在遇到拼写错误时经常失败。“星期二返回”:每个“真实的人”都可以轻松判断星期四比星期二更有可能,但它们都是 1-edit -距离!(是的,我会记录并衡量我的表现)。”“”
是的,星期四 ->星期二省略“h”,但星期二 ->星期四用“r”代替“e”。 E 和 R 在 qwERty 和 azERty 键盘上彼此相邻。每个“真人”都可以轻松地猜测星期四比星期二更有可能。即使统计数据和猜测都表明星期四比星期二更有可能(也许省略 h 将花费 0.5,而 e->r 将花费 0.75),差异(也许 0.25)是否足够大以至于总是选择星期四?您的系统可以/将会询问“您是说星期二吗?”或者周四会继续进行吗?
Some questions for you, to help you determine whether you should be asking your "where do I find real-world weights" question:
Have you actually measured the effectiveness of the uniform weighting implementation? How?
How many different "internal objects" do you have -- i.e. what is the size of your dictionary?
How are you actually using the edit distance e.g. John/Joan, Marmaduke/Marmeduke, Featherstonehaugh/Featherstonhaugh: is that "all 1 error" or is it 25% / 11.1% / 5.9% difference? What threshold are you using?
How many pairs of dictionary entries are within your threshold (e.g. John vs Joan, Joan vs Juan, etc)? If you introduced a fancy weighting system, how many pairs of dictionary entries would migrate (a) from inside the threshold to outside (b) vice versa?
What do you do if both John and Juan are in your dictionary and the user types Joan?
What are the penalties/costs of (1) choosing the wrong dictionary word (not the one that the user meant) (2) failing to recognise the user's input?
Will introducing a complicated weighting system actually reduce the probabilities of the above two error types by sufficient margin to make the complication and slower speed worthwhile?
BTW, how do you know what keyboard the user was using?
Update:
"""My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance)."""
Yes, Thursday -> Tursday by omitting an "h", but Tuesday -> Tursday by substituting "r" instead of "e". E and R are next to each other on qwERty and azERty keyboards. Every "real person" can easily guess that Thursday is more likely than Tuesday. Even if statistics as well as guesses point to Thursday being more likely than Tuesday (perhaps omitting h will cost 0.5 and e->r will cost 0.75), will the difference (perhaps 0.25) be significant enough to always pick Thursday? Can/will your system ask "Did you mean Tuesday?" or does/will it just plough ahead with Thursday?