我已经有了生成局部敏感哈希值的算法,但是我应该如何对它们进行存储以利用它们的特性(即相似的元素具有接近的哈希值(具有汉明距离))?
在 matlab 代码中,我发现他们只是在要搜索的点的哈希值和数据库中的点的哈希值之间创建一个距离矩阵,以简化代码,同时引用所谓的 Charikar 方法来实际上很好地实现搜索方法。
我试图寻找这一点,但我不确定如何将我找到的任何方法(如多探针方法)应用于我的案例。如果您已经拥有哈希值,那么这些技术似乎都不容易插入。有没有简单的示例代码?或者有什么建议吗?
这是我正在讨论的包含 matlab 代码的页面的链接:
http://www.eecs.berkeley.edu/~kulis/klsh/klsh .htm
I already have the algorithm to produce locality-sensitive hashes, but how should I bucket them to take advantage of their characteristics(i.e. similar elements have near hashes(with the hamming distance))?
In the matlab code I found they simply create a distance matrix between the hashes of the points to search and the hashes of the points in the database, to simplify the code,while referencing a so called Charikar method for an actually good implementation of the search method.
I tried to search for that, but I'm not sure how to apply to my case any of the methods I found(like the multi-probe method). None of these techniques seems easily pluggable if you already have the hashes. Is there any simple example code for this? Or any suggestion?
This is the link to the page with the matlab code I'm talking about:
http://www.eecs.berkeley.edu/~kulis/klsh/klsh.htm
发布评论
评论(1)
基于: 搜索局部敏感哈希 在阅读 相似性舍入算法的估计技术:
这个问题有点宽泛,所以我将在这里给出一个最小(抽象)的例子:
我们的数据集中有 6 (= n) 个向量,每个都有
d
位。假设我们进行 2 (=N
) 次随机排列。让第一个随机排列开始!请记住,我们排列位,而不是向量的顺序。排列这些位后,它们保持一个顺序,例如:
现在查询向量 q 到达,但它(几乎)不太可能与以下相同我们的数据集中的一个向量(排列后),因此我们不会通过执行二分搜索找到它。
然而,我们最终会处于两个向量之间。所以现在我们可以想象这样的场景(例如
q
位于 v0 和 v3 之间:现在我们向上或向下移动指针,寻找与最多位匹配的 vi 向量假设它是 v0。
类似地,我们进行第二次排列并找到向量 vi,假设我们现在比较第一次排列中的 v0 和 v4,看看哪个最接近。到
q
,即哪一个与q
相等的位最多。但是,如果您正在寻求现成的实现,您应该在 软件推荐。我还会查看我链接到的论文,看看作者是否公开了代码,或者他们是否愿意在联系他们后分享代码。
Based on: Search in locality sensitive hashing I would say this, after reading Similarity Estimation Techniques from Rounding Algorithms:
This question is somehow broad, so I am just going to give a minimal (abstract) example here:
We have 6 (=
n
) vectors in our dataset, withd
bits each. Let's assume that we do 2 (=N
) random permutation.Let the 1st random permutation begin! Remember that we permute the bits, not the order of the vectors. After permuting the bits, they maintain an order, for example:
Now the query vector,
q
, arrives, but it's (almost) unlikely that is going to be the same with a vector in our dataset (after the permutation), thus we won't find it by performing binary search.However, we are going to end up between two vectors. So now we can imagine the scenario to be like this (for example
q
lies between v0 and v3:Now we move either up or down pointer, seeking for the vi vector that will match at the most bits with
q
. Let's say it was v0.Similarly, we do the second permutation and we find the vector vi, let's say v4. we now compare v0 from the first permutation and v4, to see which one is closest to
q
, i.e. which one has the most bits equal withq
.However, if you are seeking for a ready implementation, you should ask in Software Recommendation. I would also look at the paper I linked to to see if the author(s) made the code public, or if they would like to share it after contacting them.