在数据库中存储和索引二进制字符串

发布于 2024-11-18 17:55:39 字数 308 浏览 7 评论 0原文

此处定义的二进制字符串是固定大小的位“数组”。我称它们为字符串，因为它们没有顺序（将它们排序/索引为数字没有意义），每个位都独立于其他位。每个这样的字符串都是 N 位长，N 为数百位。

我需要存储这些字符串，并使用汉明距离作为距离度量为最近邻居提供一个新的二进制字符串查询。
有专门的数据结构（度量树）用于基于度量的搜索（VP 树、覆盖树、M 树），但我需要使用常规数据库（在我的例子中为 MongoDB）。

是否有一些索引函数可以应用于二进制字符串，可以帮助数据库在执行一对一汉明距离匹配之前仅访问记录的子集？或者，如何在标准数据库上实现这种基于汉明的搜索？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三生路 2024-11-25 17:55:39

汉明距离是一个度量，因此它满足三角不等式。对于数据库中的每个位串，您可以将其汉明距离存储到某个预定义的常量位串。然后就可以利用三角不等式来过滤掉数据库中的比特串。

因此，

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

对于每个 B，您将存储它对应的 distB。

hamming(B,S) 的下限将为 abs(distB-distS)。上限为 distB+distS。

您可以丢弃所有 B，使下限高于最低上限。

我不能 100% 确定选择 C 的最佳方式。我认为您希望它是一个接近位串度量空间“中心”的位串。

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

So let's say

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

You can discard all B such that the lower bound is higher than the lowest upper bound.

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

回复收藏 0 原文