局部敏感哈希实现?
C/C++/Java/C# 中是否有相对简单易懂(且易于实现)的局部敏感哈希示例?
我想了解更多关于这个概念的信息,所以想尝试在几个文本文件上实现,只是为了看看它是如何工作的,所以我不需要任何高性能或任何东西......只是一个哈希的例子为相似的输入返回相似的哈希值的函数。之后我可以通过例子学到更多。 :)
Are there any relatively simple to understand (and simple to implement) locality-sensitive hash examples in C/C++/Java/C#?
I'd like to learn more about the concept and so want to try an implementation on a few text files just to see how it works, so I don't need anything high-performance or anything... just an example of a hash function that returns similar hashes for similar inputs. I can learn more from it by example afterwards. :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于字符串,您可以使用近似匹配算法。
如果字符串与参考字符串等距,那么它们很可能彼此相似。这样你就有了一个字符串的局部敏感哈希实现。
您可以为一定距离范围创建不同的哈希桶。
编辑:您可以尝试其他变化的字符串距离。一个更简单的算法只会返回“否”。两个字符串之间的公共字符。
For strings you can use approximate matching algorithm.
If the strings are equidistant from a reference string then chances are that they are similar to each other. And there you go you have a locality senitive hash implementation for strings.
You can create different hash buckets for a range of distances.
EDIT: You can try other variations of string distance. A simpler algorithm would just return no. of common characters between two strings.
MSDN 博客上有一篇精彩的文章: http://blogs.msdn.com/b/spt/archive/2008/06/11/locality-sensitive-hashing-lsh-and-min-hash.aspx
另外,至少还有一个 C++ 库,您可以在此处检查源代码:http://sourceforge。净/项目/lshkit/
Well there is an excellent into article at MSDN blogs here: http://blogs.msdn.com/b/spt/archive/2008/06/11/locality-sensitive-hashing-lsh-and-min-hash.aspx
Also there is at least once C++ library which you can inspect the source code of here: http://sourceforge.net/projects/lshkit/
Hadoop 上还有一个 Java 实现。它在文档方面做得很好。
它的名称为LikeLike
There is also a Java Implementation on Hadoop. it does a good job on documents.
it's called LikeLike
我知道您明确要求 C/C++/C#,但有 nilsimsa 哈希的“nofollow">Python 端口,可能是比其他更大的图书馆更容易理解。
I realise you explicitly asked for C/C++/C#, but there is a Python port of the nilsimsa hash which might be easier to grok than other, larger libraries.