快速计算具有最小汉明距离的对

发布于 2024-11-29 16:19:31 字数 688 浏览 1 评论 0原文

问题

假设您有 N (~100k-1m) 个整数/位串,每个 K(例如 256)位长。该算法应返回具有最低成对汉明距离的 k 对。

示例

N = 4
K = 8
i1 = 00010011
i2 = 01010101
i3 = 11000000
i4 = 11000011


HammingDistance(i1,i2) = 3
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 3
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 4
HammingDistance(i3,i4) = 2

对于 k=1,它应该返回对列表 {(i3,i4)}。对于 k=3,它应该返回 {(i1,i2), (i1,i4), (i3,i4)}。等等。

算法

朴素的实现计算所有成对距离,对这些对进行排序并返回具有最小距离的 k:O(N^2)。有没有更好的数据结构或算法?它看起来像来自 高效查找二进制字符串的想法由于没有单个查询整数,因此无法使用大集合中的低汉明距离

Problem

Suppose you have N (~100k-1m) integers/bitstrings each K (e.g. 256) bits long. The algorithm should return the k pairs with the lowest pairwise Hamming distance.

Example

N = 4
K = 8
i1 = 00010011
i2 = 01010101
i3 = 11000000
i4 = 11000011


HammingDistance(i1,i2) = 3
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 3
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 4
HammingDistance(i3,i4) = 2

For k=1 it should return the pairlist {(i3,i4)}. For k=3 it should return {(i1,i2), (i1,i4), (i3,i4)}. And so on.

Algorithm

The naive implementation computes all pairwise distances, sorts the pairs and returns the k with the lowest distance: O(N^2). Are there any better data structures or algorithms? It looks like the ideas from Efficiently find binary strings with low Hamming distance in large set can not be used since there is no single query integer.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

阿楠 2024-12-06 16:19:31

最近的论文“The Closest Pair Problem under the Hamming Metric”仅涉及涉及n^2 因子(除非 K 非常大)。即使只找到一对也是如此。因此,除非您对实例的结构做出进一步的假设,否则似乎很难改进这一点。例如,如果假设汉明距离不是很大,则可以采样几列,在假设这些列完全匹配的情况下将字符串哈希到桶中,然后在每个桶中分别进行两两比较。对另一组随机列重复此操作,以最大限度地减少错过某些对的可能性。

The recent paper "The Closest Pair Problem under the Hamming Metric" has only algorithms involving an n^2 factor (unless K is very large). That is even for finding only a single pair. So it seems that it is hard to improve this unless you make further assumptions about the structure of your instances. For example, if you assume the Hamming distance is not very large, you could sample a few columns, hash the strings into buckets according to these under the assumptions that these columns match exactly, and then do pairwise comparison in each bucket separately. Repeat this for another set of random columns to minimize the probability you miss some pairs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文