快速计算具有最小汉明距离的对
问题
假设您有 N (~100k-1m) 个整数/位串,每个 K(例如 256)位长。该算法应返回具有最低成对汉明距离的 k 对。
示例
N = 4
K = 8
i1 = 00010011
i2 = 01010101
i3 = 11000000
i4 = 11000011
HammingDistance(i1,i2) = 3
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 3
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 4
HammingDistance(i3,i4) = 2
对于 k=1,它应该返回对列表 {(i3,i4)}。对于 k=3,它应该返回 {(i1,i2), (i1,i4), (i3,i4)}。等等。
算法
朴素的实现计算所有成对距离,对这些对进行排序并返回具有最小距离的 k:O(N^2)。有没有更好的数据结构或算法?它看起来像来自 高效查找二进制字符串的想法由于没有单个查询整数,因此无法使用大集合中的低汉明距离。
Problem
Suppose you have N (~100k-1m) integers/bitstrings each K (e.g. 256) bits long. The algorithm should return the k pairs with the lowest pairwise Hamming distance.
Example
N = 4
K = 8
i1 = 00010011
i2 = 01010101
i3 = 11000000
i4 = 11000011
HammingDistance(i1,i2) = 3
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 3
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 4
HammingDistance(i3,i4) = 2
For k=1 it should return the pairlist {(i3,i4)}. For k=3 it should return {(i1,i2), (i1,i4), (i3,i4)}. And so on.
Algorithm
The naive implementation computes all pairwise distances, sorts the pairs and returns the k with the lowest distance: O(N^2). Are there any better data structures or algorithms? It looks like the ideas from Efficiently find binary strings with low Hamming distance in large set can not be used since there is no single query integer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最近的论文“The Closest Pair Problem under the Hamming Metric”仅涉及涉及n^2 因子(除非 K 非常大)。即使只找到一对也是如此。因此,除非您对实例的结构做出进一步的假设,否则似乎很难改进这一点。例如,如果假设汉明距离不是很大,则可以采样几列,在假设这些列完全匹配的情况下将字符串哈希到桶中,然后在每个桶中分别进行两两比较。对另一组随机列重复此操作,以最大限度地减少错过某些对的可能性。
The recent paper "The Closest Pair Problem under the Hamming Metric" has only algorithms involving an n^2 factor (unless K is very large). That is even for finding only a single pair. So it seems that it is hard to improve this unless you make further assumptions about the structure of your instances. For example, if you assume the Hamming distance is not very large, you could sample a few columns, hash the strings into buckets according to these under the assumptions that these columns match exactly, and then do pairwise comparison in each bucket separately. Repeat this for another set of random columns to minimize the probability you miss some pairs.