在数据库中存储和索引二进制字符串
此处定义的二进制字符串是固定大小的位“数组”。我称它们为字符串,因为它们没有顺序(将它们排序/索引为数字没有意义),每个位都独立于其他位。每个这样的字符串都是 N 位长,N 为数百位。
我需要存储这些字符串,并使用汉明距离作为距离度量为最近邻居提供一个新的二进制字符串查询。
有专门的数据结构(度量树)用于基于度量的搜索(VP 树、覆盖树、M 树),但我需要使用常规数据库(在我的例子中为 MongoDB)。
是否有一些索引函数可以应用于二进制字符串,可以帮助数据库在执行一对一汉明距离匹配之前仅访问记录的子集? 或者,如何在标准数据库上实现这种基于汉明的搜索?
A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.
I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).
Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the one-to-one Hamming distance match?
Alternatively, how would it be possible to implement such Hamming based search on a standard DB?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
汉明距离是一个度量,因此它满足三角不等式。对于数据库中的每个位串,您可以将其汉明距离存储到某个预定义的常量位串。然后就可以利用三角不等式来过滤掉数据库中的比特串。
因此,
对于每个
B
,您将存储它对应的distB
。hamming(B,S)
的下限将为abs(distB-distS)
。上限为distB+distS
。您可以丢弃所有
B
,使下限高于最低上限。我不能 100% 确定选择
C
的最佳方式。我认为您希望它是一个接近位串度量空间“中心”的位串。The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.
So let's say
So for each
B
, you would store it's correspondingdistB
.A lower bound for
hamming(B,S)
would then beabs(distB-distS)
. And the upper bound would bedistB+distS
.You can discard all
B
such that the lower bound is higher than the lowest upper bound.I'm not 100% sure as to the optimal way to choose which
C
. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.您可以使用类似于 levenshtein 自动机 的方法,应用到位串:
这相当于对数据库和可能的匹配列表进行合并联接,而无需生成每个可能的匹配。它会减少搜索空间,但仍然可能需要大量查询。
You could use an approach similar to levenshtein automata, applied to bitstrings:
This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.