在数据库中存储和索引二进制字符串

发布于 2024-11-18 17:55:39 字数 308 浏览 7 评论 0原文

此处定义的二进制字符串是固定大小的位“数组”。我称它们为字符串,因为它们没有顺序(将它们排序/索引为数字没有意义),每个位都独立于其他位。每个这样的字符串都是 N 位长,N 为数百位。

我需要存储这些字符串,并使用汉明距离作为距离度量为最近邻居提供一个新的二进制字符串查询。
有专门的数据结构(度量树)用于基于度量的搜索(VP 树、覆盖树、M 树),但我需要使用常规数据库(在我的例子中为 MongoDB)。

是否有一些索引函数可以应用于二进制字符串,可以帮助数据库在执行一对一汉明距离匹配之前仅访问记录的子集? 或者,如何在标准数据库上实现这种基于汉明的搜索?

A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.

I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).

Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the one-to-one Hamming distance match?
Alternatively, how would it be possible to implement such Hamming based search on a standard DB?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

三生路 2024-11-25 17:55:39

汉明距离是一个度量,因此它满足三角不等式。对于数据库中的每个位串,您可以将其汉明距离存储到某个预定义的常量位串。然后就可以利用三角不等式来过滤掉数据库中的比特串。

因此,

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

对于每个 B,您将存储它对应的 distB

hamming(B,S) 的下限将为 abs(distB-distS)。上限为 distB+distS

您可以丢弃所有 B,使下限高于最低上限。

我不能 100% 确定选择 C 的最佳方式。我认为您希望它是一个接近位串度量空间“中心”的位串。

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

So let's say

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

You can discard all B such that the lower bound is higher than the lowest upper bound.

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

恋你朝朝暮暮 2024-11-25 17:55:39

您可以使用类似于 levenshtein 自动机 的方法,应用到位串:

  1. 计算第一个(按字典顺序最小)满足汉明距离标准的位串。
  2. 从数据库中获取第一个大于或等于该值的结果,
  3. 如果该值匹配,则输出它并获取下一个结果。否则,计算下一个大于该值的值,然后转到 2。

这相当于对数据库和可能的匹配列表进行合并联接,而无需生成每个可能的匹配。它会减少搜索空间,但仍然可能需要大量查询。

You could use an approach similar to levenshtein automata, applied to bitstrings:

  1. Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
  2. Fetch the first result from the database that's greater than or equal to that value
  3. If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.

This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文