算法:大量非常稀疏的位数组,使用哪种编码

发布于 2024-09-30 08:51:39 字数 4533 浏览 0 评论 0 原文

我有一个特殊的需求,最重要的问题是:

  • 内存中
  • 非常低的内存占用
  • 速度

这是我的“问题”:我需要在内存中存储大量非常稀疏的位数组。这些位集是“仅附加”的,主要用于交叉点。我所说的巨大是指高达 200 000 个位数组。

每个位组的范围应在 [0...16 000 000] 之间。

我“仅”使用包含一些实际数据的 10 673 位数组进行了一些预测试,并得到了以下结果:

  1% of the bit arrays (  106 bit arrays) Hamming weight: at most     1 bit  set
  5% of the bit arrays (  534 bit arrays) Hamming weight: at most     4 bits set
 10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most     8 bits set
 15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most    12 bits set
 20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most    17 bits set
 25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most    22 bits set
 30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most    28 bits set
 35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most    35 bits set
 40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most    44 bits set
 45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most    55 bits set
 50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most    67 bits set
 55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most    83 bits set
 60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most   103 bits set
 65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most   128 bits set
 70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most   161 bits set
 75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most   206 bits set
 80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most   275 bits set
 85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most   395 bits set
 90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most   640 bits set
 95% of the bit arrays (10152 bit arrays) Hamming weight: at most  1453 bits set
 96% of the bit arrays (10259 bit arrays) Hamming weight: at most  1843 bits set
 97% of the bit arrays (10366 bit arrays) Hamming weight: at most  2601 bits set
 98% of the bit arrays (10473 bit arrays) Hamming weight: at most  3544 bits set
 99% of the bit arrays (10580 bit arrays) Hamming weight: at most  4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set

看到涉及的数字,我显然需要使用压缩位数组,这不是问题:它应该保持易于处理,因为位数组是“仅附加”的。

开启的位数组位有点分组,但不是完全分组。因此,您往往会在同一区域中打开多个位(但通常不会一个接一个地打开,这使得 RLE 对于打开的位来说不太好)。

我的问题是使用什么样的压缩?

现在我不知道是否应该将我的第一种方法放在这里或回答我自己的问题。

基本上我想象了一个使用非常愚蠢的编码的“最坏情况”场景:

  • 1 位:如果打开,以下 5 位确定需要多少位来计算“跳过”,如果关闭,则优化:以下 5 位确定有多少位也按字面意思(即“开”或“关”,不跳过)[只有当确定比其他表示更有效时才会切换到,因此当它启动时,它应始终是一个优化(大小方面)]

  • 5 位:我们可以在下一个位之前跳过多少位

  • 5 位:我们可以在

    x 位

这是一个示例:位数组有 3 位设置,第一位位于 3 098 137,第二位位于 3 098 141,第三位位于 3 098 143。

                               +-- now we won't skip
                               |
                               |     +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
                               |     |    +--- 3 098 141 is on
  22    3 098 137              |     3    | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc. 

第一位 on 告诉我们要跳过位。 接下来的 5 位(始终为 5)告诉我们需要多少位来告诉我们要跳过多少位 22 位指示跳至 3 098 137 现在我们不会跳过一点点 接下来的 5 位(始终为 5)告诉我们将“按原样”读取多少位 6 位:off、off、off、on、off、on 含义 3 098 141 和 3 098 143 开启 等等。

看到这些位数组惊人的稀疏性,这似乎非常节省空间。

因此,使用该编码,我获取了样本数据并计算了“最坏情况”场景(我还没有编写算法,我宁愿先从这里获取一些输入):基本上我认为不仅“大小”优化”永远不会启动,而且 5 位将始终设置为其最大值(24 位),这当然不可能发生。

我这样做只是为了对“最坏中的最坏”情况有一个非常粗略的估计。

我非常惊喜:

Worst case scenario: 

108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)

数据是实际数据,所有数据都是相似的,我知道,如果情况变得更糟,我可以将我的 200 000 位数组存储在大约 240 MB 中,这很好。

我很确定实际的编码会比这少得多,但由于我还没有真正编写它,我只能(非常容易)计算“最坏的情况”,这就是为什么我只显示那个。

关于如何提高大小效率的任何提示/想法(记住这些是超稀疏位数组,应有数十万个,它们必须在内存中,并且它们应“仅附加”) ?

关于我的“仅附加”案例

基本上,我有一个不断增长的“扩展”(范围,但“扩展”是实际的我理解的术语)和许多具有一些位集的位数组。当范围从 0 到 1 000 000 时,所有位数组的范围从 0 到 1 000 000。当范围增长到 1 000 001 时,所有位数组也会增长,全部增长一位。但大多数位数组的末尾都会附加一个“0”,而大约 4 到 8 个位数组的末尾会附加一个“1”。但是我无法提前预测哪个位数组将附加 0 或 1。

因此,我有很多大小相同的位数组,它们都非常稀疏(< 其位集的 0.5%),并且随着范围的增长而“增长”(因此它们总是以同样的速度增长)。


Judy 数组非常棒。但几年前我读到过有关它们的内容,这些东西“超出了我的想象”。 Judy 数组是一个仅 C 语言的 20KLOC 库,我绝对不会重新实现它。但他们很棒。

所以我想我需要补充一点,我希望所有这些都保持相对简单,这并不是牵强的看到我的非常稀疏的位数组的特殊“仅附加”属性。

I've got a special need and the most important concerns are:

  • in-memory
  • very low memory footprint
  • speed

Here's my "problem": I need to store, in-memory, a huge number of very sparse bit arrays. Those bitsets are "append only" and are to be used mostly for intersections. By huge, I mean as high as 200 000 bit arrays.

The range shall be between [0...16 000 000] for each bitset.

I ran some pre-test with "only" 10 673 bit arrays containing some actual data I've got and got the following results:

  1% of the bit arrays (  106 bit arrays) Hamming weight: at most     1 bit  set
  5% of the bit arrays (  534 bit arrays) Hamming weight: at most     4 bits set
 10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most     8 bits set
 15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most    12 bits set
 20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most    17 bits set
 25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most    22 bits set
 30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most    28 bits set
 35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most    35 bits set
 40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most    44 bits set
 45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most    55 bits set
 50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most    67 bits set
 55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most    83 bits set
 60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most   103 bits set
 65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most   128 bits set
 70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most   161 bits set
 75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most   206 bits set
 80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most   275 bits set
 85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most   395 bits set
 90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most   640 bits set
 95% of the bit arrays (10152 bit arrays) Hamming weight: at most  1453 bits set
 96% of the bit arrays (10259 bit arrays) Hamming weight: at most  1843 bits set
 97% of the bit arrays (10366 bit arrays) Hamming weight: at most  2601 bits set
 98% of the bit arrays (10473 bit arrays) Hamming weight: at most  3544 bits set
 99% of the bit arrays (10580 bit arrays) Hamming weight: at most  4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set

Seen the numbers involved, I obviously need to use compressed bit arrays and that is not an issue: it shall stay easy to deal with seen that the bit arrays are "append only".

The bit array bits that are on are kinda grouped, but not totally. So you'll tend to have several bits on in the same area (but usually not one after another, making RLE kinda not great for bits that are on).

My question is what kind of compression to use?

Now I don't know if I should put my first approach here or in an answer to my own question.

Basically I imagined a "worst case" scenario using a very dumb encoding:

  • 1 bit: if on, the following 5 bits determine how many bits are needed to compute the 'skip', if off, optimization: the following 5 bits determine how many bits are too be taken literally (that is 'on' or 'off', without skipping) [this would only be switched to when determined to be more efficient than the other representation, so when it kicks in, it shall always be an optimization (size-wise)]

  • 5 bits: how many bits we can skip before the next bit on

  • x bits: skip

Here's an example: a bit array has 3 bit set, the first bit being at 3 098 137, the second at 3 098 141 and the third at 3 098 143.

                               +-- now we won't skip
                               |
                               |     +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
                               |     |    +--- 3 098 141 is on
  22    3 098 137              |     3    | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc. 

First bit on tells we're going to skip bits.
5 next bits (always 5) tells how many bits we need to tell how many bits we'll skip
22 bits telling to skip to 3 098 137
one bit off telling now we're not skipping bits
5 next bits (always 5) tells how many bits we'll read "as is"
6 bits: off, off, off, on, off, on meaning 3 098 141 and 3 098 143 are on
etc.

Seen the amazing sparsity of these bit arrays, this seems quite size-efficient.

So using that encoding, I took my sample data and computed a "worst case" scenario (I haven't written the algo yet, I'd rather have a few from here inputs first): basically I considered that not only the "size optimization" would never kick in and, also, that the 5 bits would always be set to their maximum value (24 bits), which of course cannot happen.

I did it just to have a very crude approximation of what the "worst of the worst" case could be.

I was very pleasantly surprised:

Worst case scenario: 

108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)

The data being actual data and all the data being similar, I know that, if worse comes to worse, I could store my 200 000 bit arrays in about 240 MB, which is fine.

I'm pretty sure the actual encoding will comes to way less than that but as I haven't actually written it yet, I can only (very easily) compute the "worst case" which is why I only show that one.

Any hints / ideas as to how to make this more size-efficient (remembering these are super-sparse bit arrays, that there shall be hundreds thousands of them, that they must be in memory, and that they shall be "append only")?

About my 'append-only' case

Basically I've got one growing "expanse" (the range, but "expanse" is the actual term as I understand it) and a lot of bit arrays that have a few bit sets. When the range goes from, say, 0 to 1 000 000, all the bit arrays goes from 0 to 1 000 000 to. When the range grows to 1 000 001, then all the bit arrays are growing too, all by one bit. But most of these bit arrays will have a '0' appended at their end, while about 4 to 8 of the bit arrays will have a '1' appended at their end. However I cannot predict in advance which of the bit arrays will have a 0 or a 1 appended.

So I've got a lot of bit arrays that have all the same size, that are all very sparse (< 0.5% of their bits set) and that are all "growing" as the range growth (so they're all always growing at the same rate).


Judy arrays are great. But I read about them a few years ago and that stuff was "above my head". Judy arrays are a C-only 20KLOC lib and I'm definitely not re-implementing that. But they're amazing.

So I guess I need to add I'd like all this to stay relatively simple, which is not that far-fetched seen the special "append only" property of my very sparse bit arrays.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

茶底世界 2024-10-07 08:51:39

你没有说你想使用什么编程语言。听起来你不想要 Judy,因为它是“仅限 C”...如果你使用 C# 那么你可以使用我的 紧凑型 Patricia Trie 代替。大约有 4500 个 LOC(已注释),并且使用与 Judy 类似的思想,但由于 .NET 的限制,每个 trie 的大小和速度并不理想。它也没有针对计算交叉点进行优化,但可以添加这样的算法。关于CP Tries的文章没有强调这一点,但它可以比字典更紧凑地存储集合(稀疏位数组)(文章中的图表显示了字典的大小和速度,而不是集合)。

最好的情况是密集的位簇。当占用率为 50%(每隔一个位设置)时,每个密钥需要少于 8 位(每个整数少于 4 位)。 (更正:少于 8 位,而不是更多。)

如果您只需要数据的近似表示,请使用 布隆过滤器

顺便问一下,“仅附加”是什么意思?是否意味着您只添加键,或者您添加的每个键都大于您之前添加的键?

更新:由于您只是添加较大的密钥,因此您可能应该针对您的情况设计一种特殊的算法。 IMO,在设计自定义算法时,应该使其尽可能简单。所以这是我的想法,它假设不同位集的键是不相关的(因此尝试压缩不同位集之间的数据没有任何好处):

位集由 32 位槽的排序数组表示。因为它是排序的,所以您可以使用二分搜索来查找键。每个槽由 24 位“前缀”和 8 位“标志”组成。每个槽代表一个包含 8 个键的区域。 “标志”告诉您位集中存在该区域中的 8 个键中的哪一个,“前缀”通过指定键的第 3 到 26 位来告诉您我们正在讨论哪个区域。例如,如果位集中以下位为“1”:

1, 3, 4, 1094, 8001, 8002, 8007, 8009

...则位集由 4 个槽(16 字节)的数组表示:

Prefix:     0,  136, 1000, 1001
 Flags:  0x15, 0x40, 0x86, 0x02

第一个槽代表 1、3、4(请注意,位 1、3 4设置在编号0x15);第二个槽代表 1094 (136 * 8 + 6);第三个槽位代表8001、8002和8007;第四个槽代表8009。这有道理吗?

我不知道这是否像你的想法那么紧凑。但我认为您将获得更快的查询和更快的修改,并且实现起来相当容易。

You didn't say what programming language you want to use. It sounds like you don't want Judy because it's "C-only"... if you are using C# then you could use my Compact Patricia Trie instead. Is is almost 4500 LOC (commented) and uses similar ideas to Judy, but the size and speed of each trie are not ideal due to limitations of .NET. It is not optimized for computing intersections either, but such an algorithm could be added. The article about CP Tries does not emphasize this point, but it can store sets (sparse bit arrays) much more compactly than dictionaries (the graphs in the article show the size and speed of dictionaries, not sets).

The best case is a dense cluster of bits. With 50% occupancy (every other bit set), it requires less than 8 bits per key (less than 4 bits per integer). (correction: less than 8 bits, not more.)

If you only need an approximate representation of the data, use a Bloom filter.

By the way, what do you mean by "append only"? Does it mean that you only add keys, or that each key you add is greater than the keys you added before?

Update: Since you are only adding larger keys, you should probably design a special algorithm just for your case. IMO, when designing a custom algorithm, you should make it as simple as possible. So here's my idea, which assumes the keys of different bitsets are uncorrelated (therefore there is no benefit of attempting to compress data between different bitsets):

A bitset is represented by a sorted array of 32-bit slots. Because it's sorted, you can use binary search to find keys. Each slot consists of a 24-bit "prefix" and 8 bits of "flags". Each slot represents a region of 8 keys. The "flags" tell you which of the 8 keys in the region are present in the bitset, and the "prefix" tells you which region we're talking about, by specifying bits 3 to 26 of the key. For example, if the following bits are "1" in the bitset:

1, 3, 4, 1094, 8001, 8002, 8007, 8009

...then the bitset is represented by an array of 4 slots (16 bytes):

Prefix:     0,  136, 1000, 1001
 Flags:  0x15, 0x40, 0x86, 0x02

The first slot represents 1, 3, 4 (notice that bits 1, 3 and 4 are set in the number 0x15); the second slot represents 1094 (136 * 8 + 6); the third slot represents 8001, 8002, and 8007; the fourth slot represents 8009. Does this make sense?

I don't know if this is as compact as your idea. But I think you'll get faster queries and faster modifications, and it will be fairly easy to implement.

丢了幸福的猪 2024-10-07 08:51:39

您可以使用二叉树来表示位数组。
假设您有一个范围为 [M..N] 的数组。
以这样的方式存储它:

为 [0...ram size] 选择一些数字编码,例如 Fibonacci、Golomb 或 Rice 代码(在使用实际数据分析程序后,您可以选择最合适的表示形式)。

  1. 如果数组为空(没有设置位),则将其存储为数字 0。
  2. 如果数组已满(设置了所有位),则将其存储为数字 1。
  3. 否则将其分为两部分:A in [M..(M+ N)/2-1] 和 [(M+N)/2..N] 中的 B
  4. 使用此算法递归生成 P0 和 P1 的表示。
  5. 获取 P0 的长度(以位或其他单位为单位,长度可能是整数)并将其存储为数字(如果长度可能为 1,则可能需要加 1,例如,将 0 存储为单个位 0)。
  6. 先存储 P0,然后存储 P1。

在这种情况下,如果限制是常见的,则交集和并集的操作是微不足道的递归:

交集:

  1. 如果数组 A 为空,则存储 0。
  2. 如果数组 A 已满,则存储 B 的副本
  3. 否则拆分数组,使两半相交,存储前半部分的长度,然后是两半部分的长度。

该算法可以处理位(如果您需要它们最紧凑)和字节/字(如果位操作很慢)。

此外,您还可以为具有单个位集的数组添加特殊编码,所有数组的大小小于某个限制(例如 8 个元素),以降低递归级别。

缺点是,如果没有一些技巧,向数组添加/删除元素是复杂的操作(与交集/并集操作一样复杂)。

例如,设置了单个 0xAB 位的数组应存储在 0..0xFF 的数组中,如下所示(伪代码):

0  1      2   3  4      5  6  7      8  9  10     11 12     13    14     15     16
1, EMPTY, 13, 1, EMPTY, 9, 1, EMPTY, 5, 1, EMPTY, 1, EMPTY, FULL, EMPTY, EMPTY, EMPTY
                                                   |  AA  | AB  |
                                         |A8..A9| AA    ..   AB |
                                      | A8         ..        AB |AC..AF|
                            |A0..A7| A8             ..              AF |
                         | A0                 ..                    AF |B0..BF|
               |80..9F| A0                        ..                       BF |
            | 80                                 ..                        BF |C0..FF|
 | 0..7F| 80                                          ..                          FF |       

EMPTY 和 FULL 是空数组和满数组的代码,数字是元素中的长度(应替换为以字节为单位的实际长度) ,位左右)

如果您不需要快速的单个位检查,您可以使用最简单的方法:
只需使用代码存储设置位之间的距离:fibonacci、rice、golomb、levenshtein、elias 等,或者发明另一种代码。
请注意,为了获得最小的代码长度,您应该使用代码长度尽可能接近 -log p/log 2 的代码,其中 p 是该代码的概率。您可以使用霍夫曼代码。

例如,使用 elias gamma 代码,所以像这样的数组:

0 1 0000 1 1 000 1 0 1 000000000000000000 1 000000000000000000 
2     5   1   4    2         19                   18           (distance)

应该编码为:

010 00101 1 00100 010 000010011 000010010
 2    5   1   4    2     19        18     (distance code explained)

对于具有均匀位分布的数组,大多数紧凑的将是算术编码,但它非常消耗 CPU 时间。因为您必须一点一点地读取和写入此类数组,并且无法快速跳过。

You may use binary tree for bit array.
Say, you have array with range of [M..N].
Store it in such a manner:

Choose some number encoding for [0...ram size], like Fibonacci, Golomb or Rice code (you may choose most suitable representation after profiling your program with actual data).

  1. If array is empty (have no bits set), store it as number 0.
  2. If array is full (have all bits set), store it as number 1.
  3. Else split it in two parts: A in [M..(M+N)/2-1] and B in [(M+N)/2..N]
  4. Generate representations of P0 and P1 using this algorithm recursively.
  5. Get length of P0 (in bits or other units length may be whole number of) and store it as a number (you may need to add 1 if length may be 1, e.g. you store 0 as single bit 0).
  6. Store P0 then P1.

In this case, if limits are common, operations of intersection an union are trivial recursions:

Intersection:

  1. If array A is empty, store 0.
  2. If array A is full, store copy of B
  3. Else split arrays, make intersections of both halves, store length of first half, then both halves.

This algorithm may deal with bits (if you need them to be most compact) and bytes/words (if bit operations are so slow).

Also you may add specical encodings for arrays with single bit set, all arrays with size less than some limit (8 elements for example) to decrease level of recursion.

Drawback is that without some hacks adding/removing element to/from array is complex operation (as complex as intersection/union operations).

For example, array with single 0xAB bit set should be stored in array of 0..0xFF as (pseudocode for):

0  1      2   3  4      5  6  7      8  9  10     11 12     13    14     15     16
1, EMPTY, 13, 1, EMPTY, 9, 1, EMPTY, 5, 1, EMPTY, 1, EMPTY, FULL, EMPTY, EMPTY, EMPTY
                                                   |  AA  | AB  |
                                         |A8..A9| AA    ..   AB |
                                      | A8         ..        AB |AC..AF|
                            |A0..A7| A8             ..              AF |
                         | A0                 ..                    AF |B0..BF|
               |80..9F| A0                        ..                       BF |
            | 80                                 ..                        BF |C0..FF|
 | 0..7F| 80                                          ..                          FF |       

EMPTY and FULL are codes for empty and full arrays, numbers are lengths in elements (should be replaced with actual lengthts in bytes, bits or so)

Ff you do not need fast single bit check, you may use most simple approach:
Just store distances between set bits using codes: fibonacci, rice, golomb, levenshtein, elias etc. or invent another one.
Note, that in order to get minimal code length, you should use code with code lengths as close as possible to -log p/log 2, where p is probability of that code. You may use huffman code for that.

For example, use elias gamma code, so array like this:

0 1 0000 1 1 000 1 0 1 000000000000000000 1 000000000000000000 
2     5   1   4    2         19                   18           (distance)

Should be encoded as:

010 00101 1 00100 010 000010011 000010010
 2    5   1   4    2     19        18     (distance code explained)

And mostly compact for array with uniform bits distribution would be arithmetic encoding, but it is very CPU time counsumpting. Becaouse you'll have to read and write such arrays bit by bit with no fast skipping available.

吃不饱 2024-10-07 08:51:39

您可以查看压缩位图。常见的策略是使用字对齐的游程编码。

C++ 实现:

https://github.com/lemire/EWAHBoolArray

Java 实现:

https://github.com/lemire/javaewah

参考:

Daniel Lemire、Owen Kaser、Kamel Aouiche,排序改进了字对齐位图索引。数据与知识工程 69 (1),第 3-28 页,2010 年。
http://arxiv.org/abs/0901.3751

You may look into compressed bitmaps. A common strategy is to use word-aligned run-length encoding.

C++ implementation:

https://github.com/lemire/EWAHBoolArray

Java implementation:

https://github.com/lemire/javaewah

Reference:

Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.
http://arxiv.org/abs/0901.3751

策马西风 2024-10-07 08:51:39

即使它们不完全是您正在寻找的东西,也值得查看Judy trees。 Judy 是一个针对有序映射进行深度优化的库,并且一种配置是专门设计为位集而不是映射的。不过,我不认为交集是本机优化的操作之一......

总体思路是使用每层具有固定数量地址位的树,并利用每层的稀疏性。即使在最坏的情况下,这也会产生相当好的压缩,并且查询性能也很快。我相信交叉操作会相对简单并且可能非常快。

无论如何,从最好的那里窃取总是一个好主意!

Even if they aren't exactly what you're looking for, it's worth checking out Judy trees. Judy is a heavily optimized library for ordered maps, and one configuration is specifically designed as a bitset rather than a map. I don't think intersection is one of the operations natively optimized for, though...

The general idea is to use a tree with a fixed number of address bits per level, and take advantage of the sparseness at each level. This results in quite good compression even in the worst case, and fast query performance as well. I believe an intersection operation would be relatively straightforward and potentially very fast.

At any rate, it's always a good idea to steal from the best!

太阳公公是暖光 2024-10-07 08:51:39

考虑到无论如何您都将进行大量交叉测试,也许您应该尝试并行存储所有位向量。一个稀疏的 16M 条目列表。该列表中的每个条目都包含一个列表,其中列出了 200k 个输入位向量中哪些在该位置具有“1”。看起来您预计每个输入向量只设置大约 5 位,或者总条目数为 1M?对顶层和存储桶采用稻草人链表实现,并且在根本没有交集的最坏情况下(因此 1M 个存储桶,每个存储桶有 1 个元素),您可以将其全部存储在 32MB 中。

Considering you are going to do a bunch of intersection tests anyway, maybe you should try storing all of the bitvectors in parallel. One sparse, 16M entry list. Each entry in that list contains a list of which of the 200k input bitvectors has a '1' at that location. It looks like you expect to have only about 5 bits set per input vector, or 1M total entries? Taking a straw-man linked list implementation for the toplevel and the buckets, and a worst case of no intersections at all (thus 1M buckets with 1 element each) you could store it all in 32MB.

别低头,皇冠会掉 2024-10-07 08:51:39

您可能对二元决策图 (BDD) 感兴趣,更准确地说是零抑制二元决策图 (ZBDD)。

它们用于以压缩方式表示集合。与其他压缩形式不同,操作(例如集合交集或元素插入 - 您的“仅附加”事物?)直接在压缩形式上工作。

You might be interested in Binary Decision Diagrams (BDD), and more precisely Zero-suppressed Binary Decision Diagram (ZBDD).

They are used to represent sets in a compressed way. Unlike other compressed forms, operations (such as set intersections, or insertions of elements - your "append only" thing?) work directly on the compressed form.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文