我应该使用什么数据结构来查找相似的字符串?
我应该使用什么数据结构来查找相似的字符串?例如,当您在 Google 中查询字符串“hapyp brithdya”时,Google 会询问您的意思是“生日快乐”,该字符串与之前拼写错误的字符串“hapyp brithdya”非常相似。
什么数据结构在空间和时间上执行此类操作最有效?
请帮忙。非常感谢您的宝贵时间。
What data structure should I use to find similar strings? For example, when you query Google for string "hapyp brithdya", Google asks you do you mean "happy birthday", a string which is very similar to the previously misspelled string "hapyp brithdya".
What data structure will be most efficient to do this kind of operation in both space and time ?
Please help. Your time is greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
既然您要求数据结构,我将推荐 Levenshtein 自动机。
这些可以扩展到返回最有可能(根据语料库统计)的字符串校正的概率变体。请参阅 Google 的 Peter Norvig 撰写的文章“如何编写拼写校正器”,了解基本思想;将其与编辑自动机相结合需要一些有限状态传感器的知识。请参阅哈桑、诺曼和哈桑< /a> 了解更多详细信息。
Since you ask for a data structure, I'm going to recommend Levenshtein automata.
These can be extended to a probabilistic variant that returns the most likely (according to corpus statistics) correction of a string. See the essay "How to Write a Spelling Corrector" by Google's Peter Norvig for the basic idea; combining this with Levenshtein automata requires some knowledge of finite-state transducers. See Hassan, Noeman and Hassan for more details.
谷歌使用的一种学习机制是搜索历史。例如,我搜索“hapyp brithdya”,然后意识到拼写不正确,因此没有选择任何链接。我的下一个搜索将是“生日快乐”的正确拼写。从这一系列搜索中,谷歌可以发现“hapyp brithdya”实际上意味着“生日快乐”。
另一种基于相同行的评分机制可以帮助谷歌提供更可接受的拼写更正,即搜索“hapyp brithdya”导致用户点击包含“生日快乐”的链接(由谷歌搜索建议)。与用户未访问的链接中出现的“nappybirthday”相比,这增加了“happybirthday”与“hapypbrithdya”的接近度
A learning mechanism that Google uses is the search history. For example, I searched "hapyp brithdya" and then realized that the spellings were incorrect and therefore did not select any link. My next search will be for "happy birthday" the correct spellings. And from this sequence of searches google can figure out that the "hapyp brithdya" actually meant "happy birthday".
Another scoring mechanism based on the same lines that helps google to give more acceptable spelling corrections is that a search for "hapyp brithdya" leading to a click by the user on a link(suggested by google search) containing "happy birthday". This increases the proximity of "happy birthday" to "hapyp brithdya", as compared to (say) "nappy birthday" which was present in a link that the user did not visit