字符串匹配算法
我有一个带有企业数据库的 python 应用程序,我希望能够按名称搜索企业(用于自动完成目的)。
例如,考虑名称“百思买”、“麦当劳”、“索尼”和“苹果”。
我希望“app”返回“apple”,以及“appel”和“ple”。 “麦当劳”应返回“麦当劳”。 “bst b”和“best-buy”都应该返回“best buy”。
我正在寻找哪种算法,它有 python 实现吗?
谢谢!
I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
Levenshtein 距离 应该可以。
环顾四周 - 有多种语言的实现。
The Levenshtein distance should do.
Look around - there are implementations in many languages.
编辑距离可以做到这一点。
注意:这是一个距离,您必须计算数据库中的每个字符串,如果您有很多条目,这可能是一个大问题。
如果您遇到此问题,请记录用户所做的所有拼写错误(拼写错误=没有直接匹配)并离线构建一个包含所有拼写错误->修复映射的更正数据库。有些公司做得更聪明,例如:谷歌观察用户如何纠正自己的拼写错误并从中学习映射。
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex 或 Metaphone 可能有效。
Soundex or Metaphone might work.
我认为您正在寻找的是数据质量和数据清理的巨大领域。我担心你是否能找到一个与此相关的 python 实现,因为它必须能够清理数据库中的大量数据,这可能具有商业价值。
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
莱文斯泰因距离的方向是正确的,但只完成了一半。有几个技巧可以让它也使用半火柴。
一种是使用子序列动态时间扭曲(DTW 实际上是编辑距离的推广)。为此,您在计算成本矩阵时放宽开始和结束情况。如果您只放宽其中一项条件,则可以通过拼写检查获得自动完成功能。我不确定是否有可用的Python实现,但如果你想自己实现它,它不应该超过10-20个LOC。
另一个想法是使用 Trie 来加速,它可以同时对多个结果执行 DTW/Levensthein(如果数据库很大,加速会很大)。 IEEE 的 Tries 上有一篇关于 Levensthein 的论文,因此您可以在那里找到该算法。同样,为此您需要放宽最终边界条件,以便获得部分匹配。然而,由于您在 trie 中退出,因此您只需要检查何时完全消耗了输入,然后返回所有叶子。
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
检查这个http://docs.python.org/library/difflib.html
它应该对你有帮助
check this one http://docs.python.org/library/difflib.html
it should help you