基于预先计算的哈希值比较字符串距离
我有一个很大的字符串列表(超过 200,000 个),我想将它们与给定的字符串进行比较。 给定的字符串是由用户插入的,因此可能略有不正确。
我希望做的是在将每个字符串添加到列表中时创建某种预先计算的哈希值。这个散列将包含诸如字符串长度、所有字符的添加等信息。
我的问题是,这样的东西是否已经存在?当然,会有一些东西可以让我避免在列表中的每个字符串上运行 Levenshtein distance 吗?
或者也许还有我还没有想到的第三种选择?
I have a large list (over 200,000) of strings that I'd like to compare to a given string.
The given string is inserted by a user, so it may be slightly incorrect.
What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.
My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?
Or maybe there's a third option I haven't thought of yet?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
听起来你想使用某种模糊哈希。有许多可用的哈希函数可以执行此类操作。经典的旧“SOUNDEX”算法甚至可能有效。
另一种想法 - 如果您估计错误输入的概率很低,那么您实际上可能在 99.9% 的情况下直接点击,然后回退到 SOUNDEX,它可能会捕获 90% 的剩余情况,然后搜索整个内容列出剩余 0.01% 的时间。
还值得检查这个讨论:
如何找到最佳大字符串数据库中字符串的模糊匹配
Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.
Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.
Also worth checking this discussion:
How to find best fuzzy match for a string in a large string database