问题:
给定一个大型(约 1 亿)无符号 32 位整数列表、一个无符号 32 位整数输入值以及最大 汉明距离,返回输入值的指定汉明距离内的所有列表成员。
保存列表的实际数据结构是开放的,性能要求决定了内存中解决方案,构建数据结构的成本是次要的,查询数据结构的低成本是关键。
示例:
For a maximum Hamming Distance of 1 (values typically will be quite small)
And input:
00001000100000000000000001111101
The values:
01001000100000000000000001111101
00001000100000000010000001111101
should match because there is only 1 position in which the bits are different.
11001000100000000010000001111101
should not match because 3 bit positions are different.
到目前为止我的想法:
对于汉明距离为 0 的退化情况,只需使用排序列表并对特定输入值进行二分搜索即可。
如果汉明距离仅为 1,我可以翻转原始输入中的每一位并重复上述 32 次。
如何有效(无需扫描整个列表)发现汉明距离 > 的列表成员1.
Problem:
Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.
Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.
Example:
For a maximum Hamming Distance of 1 (values typically will be quite small)
And input:
00001000100000000000000001111101
The values:
01001000100000000000000001111101
00001000100000000010000001111101
should match because there is only 1 position in which the bits are different.
11001000100000000010000001111101
should not match because 3 bit positions are different.
My thoughts so far:
For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.
If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.
How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.
发布评论
评论(7)
问题:我们对汉明距离 d(x,y) 了解多少?
答案:
问题:我们为什么要关心?
答案:因为这意味着汉明距离是度量空间的度量。有用于索引度量空间的算法。
您还可以查找一般的“空间索引”算法,并了解您的空间不是欧几里得空间,而是它是一个度量空间。关于这个主题的许多书籍都介绍了使用汉明距离等度量标准的字符串索引。
脚注:如果您比较固定宽度字符串的汉明距离,则可以通过使用汇编或处理器内在函数获得显着的性能改进。例如,使用 GCC(手册),您这样做:
如果您随后通知 GCC 您正在使用 SSE4a 为计算机进行编译,那么我相信应该减少到只有几个操作码。
编辑:根据许多来源,这有时/通常比通常的掩码/移位/添加代码慢。基准测试显示,在我的系统上,C 版本的性能比 GCC 的
__builtin_popcount
高出约 160%。附录:我自己对这个问题很好奇,所以我分析了三种实现:线性搜索、BK 树和 VP 树。请注意,VP 树和 BK 树非常相似。 BK 树中节点的子节点是树的“壳”,其中包含距离树中心固定距离的点。 VP 树中的节点有两个子节点,一个包含以节点中心为中心的球体内的所有点,另一个子节点包含外部的所有点。因此,您可以将 VP 节点视为具有两个非常厚的“壳”而不是许多更细的“壳”的 BK 节点。
结果是在我的 3.2 GHz PC 上捕获的,并且算法不会尝试利用多个内核(这应该很容易)。我选择的数据库大小为 100M 伪随机整数。结果是距离 1..5 的 1000 个查询的平均值,以及距离 6..10 的 100 个查询和线性搜索的平均值。
在您的评论中,您提到:
我认为这正是 VP 树表现(稍微)优于 BK 树的原因。 “更深”而不是“更浅”,它与更多的点进行比较,而不是针对更少的点进行更细粒度的比较。我怀疑在高维空间中差异更为极端。
最后的提示:树中的叶节点应该只是线性扫描的平面整数数组。对于小集合(可能 1000 点或更少),这会更快且内存效率更高。
Question: What do we know about the Hamming distance d(x,y)?
Answer:
Question: Why do we care?
Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.
You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.
Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:
If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.
Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's
__builtin_popcount
by about 160%.Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.
The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.
In your comment, you mentioned:
I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.
A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.
我编写了一个解决方案,其中我用 232 位的位集表示输入数字,因此我可以在 O(1) 时间内检查输入中是否存在某个数字。然后,对于查询的数字和最大距离,我递归地生成该距离内的所有数字,并根据位集检查它们。
例如,对于最大距离 5,这是 242825 个数字 (sumd = 0 到 5 {32 选择 d})。作为比较,例如,Dietrich Epp 的 VP 树解决方案会遍历 1 亿个号码中的 22%,即遍历 2200 万个号码。
我使用 Dietrich 的代码/解决方案作为基础来添加我的解决方案并将其与他的进行比较。以下是最大距离达到 10 时的速度(以每秒查询数为单位):
对于短距离,bitset 解决方案是四种解决方案中迄今为止最快的。问题作者 Eric 在下面评论说,最大的兴趣距离可能是 4-5。当然,对于较大的距离,我的位集解决方案会变得更慢,甚至比线性搜索更慢(对于距离 32,它将经过 232 数字)。但距离 9 时,它仍然轻松领先。
我还修改了 Dietrich 的测试。上述每个结果都是为了让算法在大约 15 秒内解决至少三个查询和尽可能多的查询(我用 1、2、4、8、16 等查询进行循环,直到至少 10 秒完成)总共通过)。这是相当稳定的,我什至只在 1 秒内得到了类似的数字。
我的CPU是i7-6700。 我的代码(基于 Dietrich 的)在这里(至少暂时忽略那里的文档,不知道该怎么办,但是
tree.c
包含所有代码,我的test.bat
显示了我如何编译和运行(我使用了 Dietrich 的 <代码>Makefile))。 我的解决方案的快捷方式。需要注意的是:我的查询结果仅包含一次数字,因此如果输入列表包含重复的数字,则可能需要也可能不需要。在问题作者埃里克的案例中,没有重复项(请参阅下面的评论)。无论如何,这个解决方案可能适合那些输入中没有重复项或者不希望或不需要查询结果中重复项的人(我认为纯查询结果很可能只是达到目的的一种手段,然后其他一些代码将数字转换为其他内容,例如将数字映射到散列为该数字的文件列表的映射)。
I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.
For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.
I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:
For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.
I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.
My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the
tree.c
contains all the code and mytest.bat
shows how I compiled and ran (I used the flags from Dietrich'sMakefile
)). Shortcut to my solution.One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).
一种常见的方法(至少对我来说是常见的)是将位字符串分成几个块,并在这些块上查询以作为预过滤步骤的精确匹配。如果您使用文件,则可以创建与块一样多的文件(例如,此处为 4 个),每个块在前面排列,然后对文件进行排序。您可以使用二分搜索,甚至可以将搜索扩展到匹配块的上方和下方以获得奖励。
然后,您可以对返回的结果执行按位汉明距离计算,该结果应该只是整个数据集的较小子集。这可以使用数据文件或 SQL 表来完成。
回顾一下:假设您的数据库或文件中有一堆 32 位字符串,并且您希望找到位于“查询”位字符串 3 位汉明距离或更短范围内的每个哈希值:
创建一个表有四列:每列将包含 32 位哈希值的 8 位(作为字符串或整数)切片,即切片 1 到 4。或者,如果您使用文件,则创建四个文件,每个文件都是具有一个的切片的排列每个“行”前面的“islice”
以与 qslice 1 相同的方式对查询位字符串进行切片4.
查询此表,使得
qslice1=islice1 或 qslice2=islice2 或 qslice3=islice3 或 qslice4=islice4 中的任何一个.这将为您提供查询字符串 7 位 (
8 - 1
) 内的每个字符串。如果使用文件,请在四个排列文件中的每一个中执行二分搜索以获得相同的结果。步骤 4 中的操作数量应该少得多比对整个表进行完整的成对汉明计算,在实践中非常有效。
此外,根据需要使用并行性提高速度,可以轻松地将文件分割为较小的文件。
当然,在您的情况下,您正在寻找某种自连接,即彼此相距一定距离内的所有值。恕我直言,相同的方法仍然有效,尽管您必须从共享起始块并计算结果簇的汉明距离的排列(使用文件或列表)的起点向上和向下扩展。
如果在内存而不是文件中运行,您的 100M 32 位字符串数据集将在 4 GB 范围内。因此,四个排列列表可能需要大约 16GB+ 的 RAM。尽管我使用内存映射文件获得了出色的结果,并且对于类似大小的数据集必须使用更少的 RAM。
有可用的开源实现。恕我直言,这个领域最好的就是 Simhash by Moz,C++,但专为 64 位设计字符串而不是 32 位。
这种有界跳跃距离方法首先由 Moses Charikar 在其“simhash”开创性论文和相应的 Google 专利:
Monika Henziger 在她的论文“查找近似重复的网页:算法的大规模评估”:
Gurmeet Singh 的检测近似重复的网页抓取论文中也对此进行了解释Manku、Arvind Jain 和 Anish Das Sarma:
注意:我对相关的仅数据库问题发布了类似的答案
A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.
You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:
create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"
slice your query bit string the same way in qslice 1 to 4.
query this table such that any of
qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4
. This gives you every string that are within 7 bits (8 - 1
) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice.
Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.
Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.
If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.
There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.
This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
Note: I posted a similar answer to a related DB-only question
您可以预先计算指定汉明距离内原始列表的所有可能变化,并将其存储在布隆过滤器中。这给了你一个快速的“否”,但不一定是一个关于“是”的明确答案。
对于“是”,存储与布隆过滤器中每个位置关联的所有原始值的列表,并一次检查它们。优化布隆过滤器的大小以实现速度/内存权衡。
不确定这是否一切正常,但如果您有运行时 RAM 需要消耗并且愿意在预计算上花费很长时间,这似乎是一个好方法。
You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."
For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.
Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.
对列表进行排序,然后在该排序列表中对汉明距离内的不同可能值进行二分搜索怎么样?
How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?
解决此问题的一种可能方法是使用不相交集数据结构。这个想法是合并同一集合中汉明距离 <= k 的列表成员。以下是该算法的概要:
对于每个列表成员,计算汉明距离 <= k 的每个可能的值。对于 k=1,有 32 个值(对于 32 位值)。对于 k=2,32 + 32*31/2 值。
对于每个计算的值,测试它是否在原始输入中。您可以使用大小为 2^32 的数组或哈希映射来执行此检查。
如果值位于原始输入中,则与列表成员执行“并集”操作。
您可以从 N 个不相交的集合开始算法(其中 N 是输入中的元素数量)。每次执行并集运算时,不相交集的数量就会减少 1。当算法终止时,不相交集数据结构将把汉明距离 <= k 的所有值分组在不相交集中。这种不相交集数据结构可以在 几乎线性时间。
One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:
For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.
For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.
If the value is in the original input, do a "union" operation with the list member.
You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.
这里有一个简单的想法:对 100m 输入整数进行按字节基数排序,最高有效字节在前,跟踪某些外部结构中前三层的存储桶边界。
要进行查询,请从距离预算
d
和输入单词w
开始。对于顶层字节值为b
的每个桶,计算b
与w<的高字节之间的汉明距离
d_0
/代码>。以d - d_0
的预算递归搜索该存储桶:也就是说,对于每个字节值b'
,令d_1
为之间的汉明距离b'
和w
的第二个字节。递归搜索到第三层,预算为d - d_0 - d_1
,依此类推。请注意,桶形成一棵树。每当您的预算变为负数时,请停止搜索该子树。如果您递归地下降到叶子而不超出距离预算,则该叶子值应该是输出的一部分。
这是表示外部存储桶边界结构的一种方法:有一个长度为 16_777_216 (
= (2**8)**3 = 2**24
) 的数组,其中索引处的元素i
是包含 [256*i, 256*i + 255] 范围内的值的存储桶的起始索引。要查找超出该存储桶末尾的索引,请查找索引 i+1(或使用数组末尾 i + 1 = 2**24)。输入的内存预算为 100m * 4 字节/字 = 400 MB,每个地址 2**24 * 4 字节 = 64 MiB 用于索引结构,或者总共不到半千兆。索引结构比原始数据增加了 6.25% 的开销。当然,一旦构建了索引结构,您只需要存储每个输入单词的最低字节,因为其他三个字节隐含在索引结构的索引中,总共约为 (64 + 50) MB。
如果您的输入不是均匀分布的,您可以使用(单个、普遍共享)排列来排列输入单词的位,这会将所有熵置于树的顶部。这样,第一级修剪将消除搜索空间的较大块。
我尝试了一些实验,其性能与线性搜索差不多,有时甚至更差。这个奇特的想法就到此为止。哦,好吧,至少它的内存效率很高。
Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.
To query, start with a distance budget of
d
and your input wordw
. For each bucket in the top level with byte valueb
, calculate the Hamming distanced_0
betweenb
and the high byte ofw
. Recursively search that bucket with a budget ofd - d_0
: that is, for each byte valueb'
, letd_1
be the Hamming distance betweenb'
and the second byte ofw
. Recursively search into the third layer with a budget ofd - d_0 - d_1
, and so on.Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.
Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (
= (2**8)**3 = 2**24
), where the element at indexi
is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.
If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.
I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.