有效地查找大集合中具有低汉明距离的二进制字符串

发布于 2024-11-16 00:11:01 字数 874 浏览 2 评论 0 原文

问题:

给定一个大型(约 1 亿)无符号 32 位整数列表、一个无符号 32 位整数输入值以及最大 汉明距离,返回输入值的指定汉明距离内的所有列表成员。

保存列表的实际数据结构是开放的,性能要求决定了内存中解决方案,构建数据结构的成本是次要的,查询数据结构的低成本是关键。

示例:

For a maximum Hamming Distance of 1 (values typically will be quite small)

And input: 
00001000100000000000000001111101

The values:
01001000100000000000000001111101 
00001000100000000010000001111101 

should match because there is only 1 position in which the bits are different.

11001000100000000010000001111101

should not match because 3 bit positions are different.

到目前为止我的想法:

对于汉明距离为 0 的退化情况,只需使用排序列表并对特定输入值进行二分搜索即可。

如果汉明距离仅为 1,我可以翻转原始输入中的每一位并重复上述 32 次。

如何有效(无需扫描整个列表)发现汉明距离 > 的列表成员1.

Problem:

Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.

Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.

Example:

For a maximum Hamming Distance of 1 (values typically will be quite small)

And input: 
00001000100000000000000001111101

The values:
01001000100000000000000001111101 
00001000100000000010000001111101 

should match because there is only 1 position in which the bits are different.

11001000100000000010000001111101

should not match because 3 bit positions are different.

My thoughts so far:

For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.

If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.

How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

静若繁花 2024-11-23 00:11:01

问题:我们对汉明距离 d(x,y) 了解多少?

答案:

  1. 它是非负的: d(x,y) ≥ 0
  2. 对于相同的输入,它仅为零: d(x,y) = 0 ⇔ x = y
  3. 它是对称的: d(x ,y) = d(y,x)
  4. 它遵循三角不等式, d(x,z) ≤ d(x,y) + d(y,z)

问题:我们为什么要关心?

答案:因为这意味着汉明距离是度量空间度量。有用于索引度量空间的算法。

您还可以查找一般的“空间索引”算法,并了解您的空间不是欧几里得空间,而是它一个度量空间。关于这个主题的许多书籍都介绍了使用汉明距离等度量标准的字符串索引。

脚注:如果您比较固定宽度字符串的汉明距离,则可以通过使用汇编或处理器内在函数获得显着的性能改进。例如,使用 GCC(手册),您这样做:

static inline int distance(unsigned x, unsigned y)
{
    return __builtin_popcount(x^y);
}

如果您随后通知 GCC 您正在使用 SSE4a 为计算机进行编译,那么我相信应该减少到只有几个操作码。

编辑:根据许多来源,这有时/通常比通常的掩码/移位/添加代码慢。基准测试显示,在我的系统上,C 版本的性能比 GCC 的 __builtin_popcount 高出约 160%。

附录:我自己对这个问题很好奇,所以我分析了三种实现:线性搜索、BK 树和 VP 树。请注意,VP 树和 BK 树非常相似。 BK 树中节点的子节点是树的“壳”,其中包含距离树中心固定距离的点。 VP 树中的节点有两个子节点,一个包含以节点中心为中心的球体内的所有点,另一个子节点包含外部的所有点。因此,您可以将 VP 节点视为具有两个非常厚的“壳”而不是许多更细的“壳”的 BK 节点。

结果是在我的 3.2 GHz PC 上捕获的,并且算法不会尝试利用多个内核(这应该很容易)。我选择的数据库大小为 100M 伪随机整数。结果是距离 1..5 的 1000 个查询的平均值,以及距离 6..10 的 100 个查询和线性搜索的平均值。

  • 数据库:100M 伪随机整数
  • 测试数量:距离 1..5 为 1000,距离 6..10 为 100,线性
  • 结果:平均查询命中数(非常近似)
  • 速度:每秒查询数
  • 覆盖率:数据库的平均百分比根据查询进行检查
                -- BK Tree --   -- VP Tree --   -- Linear --
Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
1          0.90 3800     0.048% 4200     0.048%
2         11     300     0.68%   330     0.65%
3        130      56     3.8%     63     3.4%
4        970      18    12%       22    10%
5       5700       8.5  26%       10    22%
6       2.6e4      5.2  42%        6.0  37%
7       1.1e5      3.7  60%        4.1  54%
8       3.5e5      3.0  74%        3.2  70%
9       1.0e6      2.6  85%        2.7  82%
10      2.5e6      2.3  91%        2.4  90%
any                                             2.2     100%

在您的评论中,您提到:

我认为可以通过生成一堆具有不同根节点的 BK 树并将它们展开来改进 BK 树。

我认为这正是 VP 树表现(稍微)优于 BK 树的原因。 “更深”而不是“更浅”,它与更多的点进行比较,而不是针对更少的点进行更细粒度的比较。我怀疑在高维空间中差异更为极端。

最后的提示:树中的叶节点应该只是线性扫描的平面整数数组。对于小集合(可能 1000 点或更少),这会更快且内存效率更高。

Question: What do we know about the Hamming distance d(x,y)?

Answer:

  1. It is non-negative: d(x,y) ≥ 0
  2. It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
  3. It is symmetric: d(x,y) = d(y,x)
  4. It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

Question: Why do we care?

Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

static inline int distance(unsigned x, unsigned y)
{
    return __builtin_popcount(x^y);
}

If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.

Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.

The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

  • Database: 100M pseudorandom integers
  • Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
  • Results: Average # of query hits (very approximate)
  • Speed: Number of queries per second
  • Coverage: Average percentage of database examined per query
                -- BK Tree --   -- VP Tree --   -- Linear --
Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
1          0.90 3800     0.048% 4200     0.048%
2         11     300     0.68%   330     0.65%
3        130      56     3.8%     63     3.4%
4        970      18    12%       22    10%
5       5700       8.5  26%       10    22%
6       2.6e4      5.2  42%        6.0  37%
7       1.1e5      3.7  60%        4.1  54%
8       3.5e5      3.0  74%        3.2  70%
9       1.0e6      2.6  85%        2.7  82%
10      2.5e6      2.3  91%        2.4  90%
any                                             2.2     100%

In your comment, you mentioned:

I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.

在你怀里撒娇 2024-11-23 00:11:01

我编写了一个解决方案,其中我用 232 位的位集表示输入数字,因此我可以在 O(1) 时间内检查输入中是否存在某个数字。然后,对于查询的数字和最大距离,我递归地生成该距离内的所有数字,并根据位集检查它们。

例如,对于最大距离 5,这是 242825 个数字 (sumd = 0 到 5 {32 选择 d})。作为比较,例如,Dietrich Epp 的 VP 树解决方案会遍历 1 亿个号码中的 22%,即遍历 2200 万个号码。

我使用 Dietrich 的代码/解决方案作为基础来添加我的解决方案并将其与他的进行比较。以下是最大距离达到 10 时的速度(以每秒查询数为单位):

Dist     BK Tree     VP Tree         Bitset   Linear

   1   10,133.83   15,773.69   1,905,202.76   4.73
   2      677.78    1,006.95     218,624.08   4.70
   3      113.14      173.15      27,022.32   4.76
   4       34.06       54.13       4,239.28   4.75
   5       15.21       23.81         932.18   4.79
   6        8.96       13.23         236.09   4.78
   7        6.52        8.37          69.18   4.77
   8        5.11        6.15          23.76   4.68
   9        4.39        4.83           9.01   4.47
  10        3.69        3.94           2.82   4.13

Prepare     4.1s       21.0s          1.52s  0.13s
times (for building the data structure before the queries)

对于短距离,bitset 解决方案是四种解决方案中迄今为止最快的。问题作者 Eric 在下面评论说,最大的兴趣距离可能是 4-5。当然,对于较大的距离,我的位集解决方案会变得更慢,甚至比线性搜索更慢(对于距离 32,它将经过 232 数字)。但距离 9 时,它仍然轻松领先。

我还修改了 Dietrich 的测试。上述每个结果都是为了让算法在大约 15 秒内解决至少三个查询和尽可能多的查询(我用 1、2、4、8、16 等查询进行循环,直到至少 10 秒完成)总共通过)。这是相当稳定的,我什至只在 1 秒内得到了类似的数字。

我的CPU是i7-6700。 我的代码(基于 Dietrich 的)在这里(至少暂时忽略那里的文档,不知道该怎么办,但是 tree.c 包含所有代码,我的 test.bat 显示了我如何编译和运行(我使用了 Dietrich 的 <代码>Makefile))。 我的解决方案的快捷方式

需要注意的是:我的查询结果仅包含一次数字,因此如果输入列表包含重复的数字,则可能需要也可能不需要。在问题作者埃里克的案例中,没有重复项(请参阅下面的评论)。无论如何,这个解决方案可能适合那些输入中没有重复项或者不希望或不需要查询结果中重复项的人(我认为纯查询结果很可能只是达到目的的一种手段,然后其他一些代码将数字转换为其他内容,例如将数字映射到散列为该数字的文件列表的映射)。

I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.

For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.

I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:

Dist     BK Tree     VP Tree         Bitset   Linear

   1   10,133.83   15,773.69   1,905,202.76   4.73
   2      677.78    1,006.95     218,624.08   4.70
   3      113.14      173.15      27,022.32   4.76
   4       34.06       54.13       4,239.28   4.75
   5       15.21       23.81         932.18   4.79
   6        8.96       13.23         236.09   4.78
   7        6.52        8.37          69.18   4.77
   8        5.11        6.15          23.76   4.68
   9        4.39        4.83           9.01   4.47
  10        3.69        3.94           2.82   4.13

Prepare     4.1s       21.0s          1.52s  0.13s
times (for building the data structure before the queries)

For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.

I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.

My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.

One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).

牵你手 2024-11-23 00:11:01

一种常见的方法(至少对我来说是常见的)是将位字符串分成几个块,并在这些块上查询以作为预过滤步骤的精确匹配。如果您使用文件,则可以创建与块一样多的文件(例如,此处为 4 个),每个块在前面排列,然后对文件进行排序。您可以使用二分搜索,甚至可以将搜索扩展到匹配块的上方和下方以获得奖励。

然后,您可以对返回的结果执行按位汉明距离计算,该结果应该只是整个数据集的较小子集。这可以使用数据文件或 SQL 表来完成。

回顾一下:假设您的数据库或文件中有一堆 32 位字符串,并且您希望找到位于“查询”位字符串 3 位汉明距离或更短范围内的每个哈希值:

  1. 创建一个表有四列:每列将包含 32 位哈希值的 8 位(作为字符串或整数)切片,即切片 1 到 4。或者,如果您使用文件,则创建四个文件,每个文件都是具有一个的切片的排列每个“行”前面的“islice”

  2. 以与 qslice 1 相同的方式对查询位字符串进行切片4.

  3. 查询此表,使得 qslice1=islice1 或 qslice2=islice2 或 qslice3=islice3 或 qslice4=islice4 中的任何一个.这将为您提供查询字符串 7 位 (8 - 1) 内的每个字符串。如果使用文件,请在四个排列文件中的每一个中执行二分搜索以获得相同的结果。

  4. 返回的位字符串,与查询位字符串成对计算精确的汉明距离(从数据库或排列文件中的四个切片重建索引侧位字符串)

步骤 4 中的操作数量应该少得多比对整个表进行完整的成对汉明计算,在实践中非常有效。
此外,根据需要使用并行性提高速度,可以轻松地将文件分割为较小的文件。

当然,在您的情况下,您正在寻找某种自连接,即彼此相距一定距离内的所有值。恕我直言,相同的方法仍然有效,尽管您必须从共享起始块并计算结果簇的汉明距离的排列(使用文件或列表)的起点向上和向下扩展。

如果在内存而不是文件中运行,您的 100M 32 位字符串数据集将在 4 GB 范围内。因此,四个排列列表可能需要大约 16GB+ 的 RAM。尽管我使用内存映射文件获得了出色的结果,并且对于类似大小的数据集必须使用更少的 RAM。

有可用的开源实现。恕我直言,这个领域最好的就是 Simhash by Moz,C++,但专为 64 位设计字符串而不是 32 位。

这种有界跳跃距离方法首先由 Moses Charikar 在其“simhash”开创性论文和相应的 Google 专利

  • 汉明空间中的近似最近邻搜索
  • [...]

    给定每个由 d 位组成的位向量,我们选择
    N = O(n 1/(1+ ) ) 位的随机排列。对于每个
    随机排列 σ,我们维护一个排序顺序 O σ
    位向量,按排列位的字典顺序排列
    通过 σ。给定一个查询位向量 q,我们找到近似值
    通过执行以下操作来确定最近邻居:

    对于每个排列 σ,我们对 O σ 执行二分搜索以找到
    最接近 q 的两个位向量(按由 σ 排列的位获得的字典顺序)。我们现在在每个中搜索
    排序顺序 O σ 检查上方和下方的元素
    按二分查找的顺序返回的位置
    匹配q的最长前缀的长度。

    Monika Henziger 在她的论文“查找近似重复的网页:算法的大规模评估”

    3.3 算法C的结果

    我们将每个页面的位串划分为 12 个非
    重叠 4 字节片段,创建 20B 片段,并计算至少有一个片段的所有页面的 C 相似度
    共同点。这种方法保证找到所有
    差异最大为 11 的页面对,即 C 相似度 373,
    但可能会因较大差异而错过一些。

    Gurmeet Singh 的检测近似重复的网页抓取论文中也对此进行了解释Manku、Arvind Jain 和 Anish Das Sarma:

  • 汉明距离问题
  • 定义:给定一个 f 位指纹的集合和一个
    F、查询指纹,识别是否存在指纹
    与 F 最多有 k 位不同。 (在批处理模式版本中
    对于上述问题,我们有一组查询指纹
    而不是单个查询指纹)

    [...]

    直觉:考虑一个 2 df 位真正随机指纹的排序表。只关注最重要的 d 位
    在表中。这些 d 位数字的列表相当于
    “几乎是一个计数器”的意思是(a)相当多的 2 d 位 -
    存在组合,并且 (b) d 位组合很少
    重复的。另一方面,最不重要的 f − d
    位“几乎是随机的”。

    现在选择 d 使得 |d − d|是一个小整数。自从
    该表已排序,单个探针足以识别与 d 个最高有效位位置中的 F 匹配的所有指纹。
    由于 |d − d|少,这样的比赛数量也少
    预计会很小。对于每个匹配的指纹,我们可以
    轻松判断它是否与 F 最多有 k 位不同
    或不(这些差异自然会限于
    f − d 最低有效位位置)。

    上述过程可以帮助我们找到现有的
    与 F 有 k 位不同的指纹,所有这些
    被限制为最低有效的 f − d 位
    F. 这可以处理相当多的情况。覆盖所有
    在这种情况下,建造少量的额外设施就足够了
    排序表,如下一节中正式概述。

    注意:我对相关的仅数据库问题发布了类似的答案

    A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.

    You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

    So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:

    1. create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"

    2. slice your query bit string the same way in qslice 1 to 4.

    3. query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every string that are within 7 bits (8 - 1) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.

    4. for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)

    The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice.
    Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.

    Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.

    If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.

    There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.

    This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

    1. APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

    [...]

    Given bit vectors consisting of d bits each, we choose
    N = O(n 1/(1+ ) ) random permutations of the bits. For each
    random permutation σ, we maintain a sorted order O σ of
    the bit vectors, in lexicographic order of the bits permuted
    by σ. Given a query bit vector q, we find the approximate
    nearest neighbor by doing the following:

    For each permutation σ, we perform a binary search on O σ to locate the
    two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
    the sorted orders O σ examining elements above and below
    the position returned by the binary search in order of the
    length of the longest prefix that matches q.

    Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":

    3.3 The Results for Algorithm C

    We partitioned the bit string of each page into 12 non-
    overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
    piece in common. This approach is guaranteed to find all
    pairs of pages with difference up to 11, i.e., C-similarity 373,
    but might miss some for larger differences.

    This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

    1. THE HAMMING DISTANCE PROBLEM

    Definition: Given a collection of f -bit fingerprints and a
    query fingerprint F, identify whether an existing fingerprint
    differs from F in at most k bits. (In the batch-mode version
    of the above problem, we have a set of query fingerprints
    instead of a single query fingerprint)

    [...]

    Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
    in the table. A listing of these d-bit numbers amounts to
    “almost a counter” in the sense that (a) quite a few 2 d bit-
    combinations exist, and (b) very few d-bit combinations are
    duplicated. On the other hand, the least significant f − d
    bits are “almost random”.

    Now choose d such that |d − d| is a small integer. Since
    the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
    Since |d − d| is small, the number of such matches is also
    expected to be small. For each matching fingerprint, we can
    easily figure out if it differs from F in at most k bit-positions
    or not (these differences would naturally be restricted to the
    f − d least-significant bit-positions).

    The procedure described above helps us locate an existing
    fingerprint that differs from F in k bit-positions, all of which
    are restricted to be among the least significant f − d bits of
    F. This takes care of a fair number of cases. To cover all
    the cases, it suffices to build a small number of additional
    sorted tables, as formally outlined in the next Section.

    Note: I posted a similar answer to a related DB-only question

    叫思念不要吵 2024-11-23 00:11:01

    您可以预先计算指定汉明距离内原始列表的所有可能变化,并将其存储在布隆过滤器中。这给了你一个快速的“否”,但不一定是一个关于“是”的明确答案。

    对于“是”,存储与布隆过滤器中每个位置关联的所有原始值的列表,并一次检查它们。优化布隆过滤器的大小以实现速度/内存权衡。

    不确定这是否一切正常,但如果您有运行时 RAM 需要消耗并且愿意在预计算上花费很长时间,这似乎是一个好方法。

    You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."

    For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.

    Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.

    私野 2024-11-23 00:11:01

    对列表进行排序,然后在该排序列表中对汉明距离内的不同可能值进行二分搜索怎么样?

    How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?

    心安伴我暖 2024-11-23 00:11:01

    解决此问题的一种可能方法是使用不相交集数据结构。这个想法是合并同一集合中汉明距离 <= k 的列表成员。以下是该算法的概要:

    • 对于每个列表成员,计算汉明距离 <= k 的每个可能的。对于 k=1,有 32 个值(对于 32 位值)。对于 k=2,32 + 32*31/2 值。

      • 对于每个计算的,测试它是否在原始输入中。您可以使用大小为 2^32 的数组或哈希映射来执行此检查。

      • 如果位于原始输入中,则与列表成员执行“并集”操作。

      • 将执行的联合运算的数量保留在变量中。

    您可以从 N 个不相交的集合开始算法(其中 N 是输入中的元素数量)。每次执行并集运算时,不相交集的数量就会减少 1。当算法终止时,不相交集数据结构将把汉明距离 <= k 的所有值分组在不相交集中。这种不相交集数据结构可以在 几乎线性时间

    One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:

    • For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.

      • For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.

      • If the value is in the original input, do a "union" operation with the list member.

      • Keep the number of union operations executed in a variable.

    You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.

    淤浪 2024-11-23 00:11:01

    这里有一个简单的想法:对 100m 输入整数进行按字节基数排序,最高有效字节在前,跟踪某些外部结构中前三层的存储桶边界。

    要进行查询,请从距离预算 d 和输入单词 w 开始。对于顶层字节值为b的每个桶,计算bw<的高字节之间的汉明距离d_0 /代码>。以 d - d_0 的预算递归搜索该存储桶:也就是说,对于每个字节值 b',令 d_1 为之间的汉明距离b'w 的第二个字节。递归搜索到第三层,预算为d - d_0 - d_1,依此类推。

    请注意,桶形成一棵树。每当您的预算变为负数时,请停止搜索该子树。如果您递归地下降到叶子而不超出距离预算,则该叶子值应该是输出的一部分。

    这是表示外部存储桶边界结构的一种方法:有一个长度为 16_777_216 (= (2**8)**3 = 2**24) 的数组,其中索引 处的元素i 是包含 [256*i, 256*i + 255] 范围内的值的存储桶的起始索引。要查找超出该存储桶末尾的索引,请查找索引 i+1(或使用数组末尾 i + 1 = 2**24)。

    输入的内存预算为 100m * 4 字节/字 = 400 MB,每个地址 2**24 * 4 字节 = 64 MiB 用于索引结构,或者总共不到半千兆。索引结构比原始数据增加了 6.25% 的开销。当然,一旦构建了索引结构,您只需要存储每个输入单词的最低字节,因为其他三个字节隐含在索引结构的索引中,总共约为 (64 + 50) MB。

    如果您的输入不是均匀分布的,您可以使用(单个、普遍共享)排列来排列输入单词的位,这会将所有熵置于树的顶部。这样,第一级修剪将消除搜索空间的较大块。

    我尝试了一些实验,其性能与线性搜索差不多,有时甚至更差。这个奇特的想法就到此为止。哦,好吧,至少它的内存效率很高。

    Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.

    To query, start with a distance budget of d and your input word w. For each bucket in the top level with byte value b, calculate the Hamming distance d_0 between b and the high byte of w. Recursively search that bucket with a budget of d - d_0: that is, for each byte value b', let d_1 be the Hamming distance between b' and the second byte of w. Recursively search into the third layer with a budget of d - d_0 - d_1, and so on.

    Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.

    Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (= (2**8)**3 = 2**24), where the element at index i is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).

    Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.

    If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.

    I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.

    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文