使用比普通的bitset更多的存储空间的咆哮位图
我有一个比特斯集来跟踪是否存在的物品
b = 01100110000,
它表示存在第二和第三个项目,并且不存在第一个和第四项。
在搜索可以优化此BITSET数组的库时。我遇到咆哮的bitmaps 听起来非常令人兴奋。
我对此进行了快速测试,
public static void main(String[] args) throws IOException {
RoaringBitmap roaringBitMap = new RoaringBitmap();
BitSet bitSet = new BitSet(5000);
double prob = 0.001;
Random random = new Random();
for (int i = 0; i < 5000; i++) {
if (random.nextDouble() < prob) {
bitSet.set(i);
roaringBitMap.add(i);
}
}
System.out.println(bitSet.cardinality());
System.out.println("bitset bytes: "+ bitSet.size());
System.out.println("RoaringBitmap bytes: " + roaringBitMap.getSizeInBytes() * 8);
}
基本上我们正在设置一些值并检查数据结构的整体大小。
当我们使用多个概率值运行时。我得到了
Prob Byte | Bitset字节 | roaringbitMap字节 |
---|---|---|
0.001 | 5056 | 288 |
0.01 | 5056 | 944 |
0.1 | 5056 | 7872 |
0.999 | 5056 | 65616 |
如果您看到的是我们插入越来越多的数字,那么RoaringBitmap的记忆足迹会增加。
- 这是预期的吗?
- 在最坏的情况下,它不仅应该回到基于BITSet的Impartaiton?
- 0.999不能被视为0.001的倒数,我们可以将其存储在288个字节中吗?
- 当我们进行服务互相调用并使用Jackson库时,将这些BITT表示为字符串的最佳方法是什么(但不是基于字节的序列化库)
I have a bitset which i am using to track whether an item is present or not example
b = 01100110000
it represents that 2nd and 3rd items are present and 1st and 4th item are not present.
While searching for library which can optimise this bitset array. I came across Roaring bitmaps which sounded very exciting.
I did a quick test with it,
public static void main(String[] args) throws IOException {
RoaringBitmap roaringBitMap = new RoaringBitmap();
BitSet bitSet = new BitSet(5000);
double prob = 0.001;
Random random = new Random();
for (int i = 0; i < 5000; i++) {
if (random.nextDouble() < prob) {
bitSet.set(i);
roaringBitMap.add(i);
}
}
System.out.println(bitSet.cardinality());
System.out.println("bitset bytes: "+ bitSet.size());
System.out.println("RoaringBitmap bytes: " + roaringBitMap.getSizeInBytes() * 8);
}
Basically we are setting some values and check overall size of data structure.
when we run this with multiple prob values. I got
prob byte | bitset bytes | RoaringBitmap bytes |
---|---|---|
0.001 | 5056 | 288 |
0.01 | 5056 | 944 |
0.1 | 5056 | 7872 |
0.999 | 5056 | 65616 |
If you see as we insert more and more numbers, the memory footprint of RoaringBitmap increases.
- Is this expected?
- In the worst case should it not just fall back to bitset based implementaiton?
- can't 0.999 be treated as inverse of 0.001 and we would be able to store it in 288 bytes?
- What is the most optimal way to represent these bitset as String when we are making inter service calls and using jackson library (but not byte based serialisation libraries)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
图格式具有公共规范:
https://github.com/roarringbitmap/roarringbitmap/roaringbitmap/roaringbitmap/roaringbitmap/roarringformatematem
咆哮的位 用法只是应用程序性能的一个因素。咆哮的位图寻求提供经济的存储,同时在现实世界中提供高性能。
给定[0,x)中的n个整数,那么咆哮位图的字节中的序列化大小绝不应该超过此界限:
8 + 9 *(((long)x + 65535)/65536 + 2 * n,
也就是固定的宇宙大小(x)的开销,咆哮的位图永远不会使用超过2个字节。
没有一个总是理想的数据结构的东西。您应该确保咆哮的位图适合您的应用程序配置文件。在至少两种情况下,咆哮的位图可以轻松地用优越的替代方案来替代:
在大间隔中,您的随机值很少(即,您的集合非常稀疏)。例如,以0、65536、131072、196608、262144 ...以下集合,如果这是应用程序的典型特征,则可以考虑使用哈希集或简单排序的数组。
您有一组密集的随机值,这些值永远不会形成连续值的运行。例如,考虑集合0,2,4,...,10000。如果这是您应用程序的典型特征,则最好将您与传统的bitset一起服务。
有关咆哮的相关参考
态
软件:练习和经验第46卷,第5期,第709–719页,2016年5月
The Roaring Bitmap format has a public specification:
https://github.com/RoaringBitmap/RoaringFormatSpec
The memory usage is only one factor in an application's performance. Roaring bitmaps seek to provide economical storage, while also providing high performance in real-world applications.
Given N integers in [0,x), then the serialized size in bytes of a Roaring bitmap should never exceed this bound:
8 + 9 * ((long)x+65535)/65536 + 2 * N
That is, given a fixed overhead for the universe size (x), Roaring bitmaps never use more than 2 bytes per integer.
There is no such thing as a data structure that is always ideal. You should make sure that Roaring bitmaps fit your application profile. There are at least two cases where Roaring bitmaps can be easily replaced by superior alternatives compression-wise:
You have few random values spanning in a large interval (i.e., you have a very sparse set). For example, take the set 0, 65536, 131072, 196608, 262144 ... If this is typical of your application, you might consider using a hash set or a simple sorted array.
You have dense set of random values that never form runs of continuous values. For example, consider the set 0,2,4,...,10000. If this is typical of your application, you might be better served with a conventional bitset.
Relevant references about Roaring
Better bitmap performance with Roaring bitmaps,
Software: Practice and Experience Volume 46, Issue 5, pages 709–719, May 2016
当条目数量很少时,似乎是这种情况,但是随着我们增加条目的数量,不同的情况变得不那么明显。尽管Lib作者没有证实它(我问在这里,然后跟进 nofollow noreferrer”>在这里)
似乎不像是这样的 场景。带走的是不要盲目使用库,对您的用例进行测试。
this seems to be the case when number of entries are small, But as we increase the number of entries, the different becomes less visible. Although it is not confirmed by the lib author ( i asked here and followed up here)
looking at this it seems like as number of entries grows they might be using bitmap only behind the scene. The take away is that don't blindly use the library, test for your use case.
RoaringBitMap
具有三种类型的容器,在我们的方案中,我们主要使用bitmapcontainer
。每个bitmapcontainer
可以存储65535个元素,占据8k内存。但是,该存储根据高和低16位分配。根据高16位选择相应的存储桶。如果不存在相应的存储桶(元素范围差为65535),则将存储在新的存储桶中。在数据过于稀疏的情况下,这可能会导致许多未填充的水桶,从而导致一些冗余的内存。这可能导致roaringbitmap
在某些情况下,比常规位图使用更多的内存。如果稀疏数据的范围差异为65535或元素的数量很少,则使用比位图使用的内存少。它最大的优势是它不需要根据最大偏移分配空间。RoaringBitmap
has three types of containers, and in our scenarios, we mostly useBitmapContainer
. EachBitmapContainer
can store 65535 elements, occupying 8k of memory. However, the storage is split according to the high and low 16 bits. The corresponding bucket is selected based on the high 16 bits. If the corresponding bucket does not exist (the element range difference is 65535), it will be stored in a new bucket. In the case of overly sparse data, this can lead to many buckets that are not filled up, resulting in some redundant memory. This can causeRoaringBitmap
to use more memory than a regular bitmap in some scenarios. If the range difference of sparse data is 65535 or the number of elements is small, it will use less memory than Bitmap. Its biggest advantage is that it does not need to allocate space according to the maximum offset.