使用比普通的bitset更多的存储空间的咆哮位图

发布于 2025-02-11 08:48:37 字数 1547 浏览 4 评论 0原文

我有一个比特斯集来跟踪是否存在的物品

b = 01100110000，

它表示存在第二和第三个项目，并且不存在第一个和第四项。

在搜索可以优化此BITSET数组的库时。我遇到咆哮的bitmaps 听起来非常令人兴奋。

我对此进行了快速测试，

    public static void main(String[] args) throws IOException {
        RoaringBitmap roaringBitMap = new RoaringBitmap();
        BitSet bitSet = new BitSet(5000);
        double prob = 0.001;
        Random random = new Random();
        for (int i = 0; i < 5000; i++) {
            if (random.nextDouble() < prob) {
                bitSet.set(i);
                roaringBitMap.add(i);
            }
        }
        System.out.println(bitSet.cardinality());
        System.out.println("bitset bytes: "+ bitSet.size());
        System.out.println("RoaringBitmap bytes: " + roaringBitMap.getSizeInBytes() * 8);
    }

基本上我们正在设置一些值并检查数据结构的整体大小。

当我们使用多个概率值运行时。我得到了

Prob Byte	Bitset字节	roaringbitMap字节
0.001	5056	288
0.01	5056	944
0.1	5056	7872
0.999	5056	65616

如果您看到的是我们插入越来越多的数字，那么RoaringBitmap的记忆足迹会增加。

这是预期的吗？
在最坏的情况下，它不仅应该回到基于BITSet的Impartaiton？
0.999不能被视为0.001的倒数，我们可以将其存储在288个字节中吗？
当我们进行服务互相调用并使用Jackson库时，将这些BITT表示为字符串的最佳方法是什么（但不是基于字节的序列化库）

原文

I have a bitset which i am using to track whether an item is present or not example

b = 01100110000

it represents that 2nd and 3rd items are present and 1st and 4th item are not present.

While searching for library which can optimise this bitset array. I came across Roaring bitmaps which sounded very exciting.

I did a quick test with it,

    public static void main(String[] args) throws IOException {
        RoaringBitmap roaringBitMap = new RoaringBitmap();
        BitSet bitSet = new BitSet(5000);
        double prob = 0.001;
        Random random = new Random();
        for (int i = 0; i < 5000; i++) {
            if (random.nextDouble() < prob) {
                bitSet.set(i);
                roaringBitMap.add(i);
            }
        }
        System.out.println(bitSet.cardinality());
        System.out.println("bitset bytes: "+ bitSet.size());
        System.out.println("RoaringBitmap bytes: " + roaringBitMap.getSizeInBytes() * 8);
    }

Basically we are setting some values and check overall size of data structure.

when we run this with multiple prob values. I got

prob byte	bitset bytes	RoaringBitmap bytes
0.001	5056	288
0.01	5056	944
0.1	5056	7872
0.999	5056	65616

If you see as we insert more and more numbers, the memory footprint of RoaringBitmap increases.

Is this expected?
In the worst case should it not just fall back to bitset based implementaiton?
can't 0.999 be treated as inverse of 0.001 and we would be able to store it in 288 bytes?
What is the most optimal way to represent these bitset as String when we are making inter service calls and using jackson library (but not byte based serialisation libraries)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

格子衫的從容 2025-02-18 08:48:37

图格式具有公共规范：

https://github.com/roarringbitmap/roarringbitmap/roaringbitmap/roaringbitmap/roaringbitmap/roarringformatematem

咆哮的位用法只是应用程序性能的一个因素。咆哮的位图寻求提供经济的存储，同时在现实世界中提供高性能。

给定[0，x）中的n个整数，那么咆哮位图的字节中的序列化大小绝不应该超过此界限：

8 + 9 *（（（long）x + 65535）/65536 + 2 * n，

也就是固定的宇宙大小（x）的开销，咆哮的位图永远不会使用超过2个字节。

没有一个总是理想的数据结构的东西。您应该确保咆哮的位图适合您的应用程序配置文件。在至少两种情况下，咆哮的位图可以轻松地用优越的替代方案来替代：

在大间隔中，您的随机值很少（即，您的集合非常稀疏）。例如，以0、65536、131072、196608、262144 ...以下集合，如果这是应用程序的典型特征，则可以考虑使用哈希集或简单排序的数组。

您有一组密集的随机值，这些值永远不会形成连续值的运行。例如，考虑集合0,2,4，...，10000。如果这是您应用程序的典型特征，则最好将您与传统的bitset一起服务。

有关咆哮的相关参考

丹尼尔·莱米尔（Daniel Lemire），欧文·凯瑟（Owen Kaser），内森·库尔兹（Nathan Kurz），卢卡·德里（Luca Deri），克里斯·奥哈拉（Chris O'Hara），弗朗索瓦·雅克（FrançoisSaint-jacques），格雷戈里（Gregory Saint-Jacques），格雷戈里（Gregory Ssi-yan-kai），咆哮的位图：实施优化的软件库，软件：练习和体验第48卷，第48期，第467-895页，第867-895页
Samy Chambi，Daniel Lemire，Daniel Lemire，Daniel Lemire，Daniel Lemire，，Daniel Lemire，欧文·卡瑟（Owen Kaser），罗伯特·戈丁（Robert Godin），
态
软件：练习和经验第46卷，第5期，第709–719页，2016年5月
，丹尼尔·莱米尔（Daniel Lemire），格雷戈里（Gregory Ssi-Yan-kai），欧文·卡瑟（Owen Kaser），咆哮，软件：练习和体验第46卷，第11期，第1547-1569页，2016年11月116日，
萨米·尚比（Samy Chambi），丹尼尔·伦米尔（Samy Chambi），丹尼尔·莱米尔（Daniel Lemire），罗伯特·戈丁（Robert Godin ，fangjin yang，用咆哮的位图优化druid ，创意，2016，2016，2016。

The Roaring Bitmap format has a public specification:

https://github.com/RoaringBitmap/RoaringFormatSpec

The memory usage is only one factor in an application's performance. Roaring bitmaps seek to provide economical storage, while also providing high performance in real-world applications.

Given N integers in [0,x), then the serialized size in bytes of a Roaring bitmap should never exceed this bound:

8 + 9 * ((long)x+65535)/65536 + 2 * N

That is, given a fixed overhead for the universe size (x), Roaring bitmaps never use more than 2 bytes per integer.

There is no such thing as a data structure that is always ideal. You should make sure that Roaring bitmaps fit your application profile. There are at least two cases where Roaring bitmaps can be easily replaced by superior alternatives compression-wise:

You have few random values spanning in a large interval (i.e., you have a very sparse set). For example, take the set 0, 65536, 131072, 196608, 262144 ... If this is typical of your application, you might consider using a hash set or a simple sorted array.

You have dense set of random values that never form runs of continuous values. For example, consider the set 0,2,4,...,10000. If this is typical of your application, you might be better served with a conventional bitset.

Relevant references about Roaring

Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai, Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience Volume 48, Issue 4 April 2018 Pages 867-895
Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin,
Better bitmap performance with Roaring bitmaps,
Software: Practice and Experience Volume 46, Issue 5, pages 709–719, May 2016
Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser, Consistently faster and smaller compressed bitmaps with Roaring, Software: Practice and Experience Volume 46, Issue 11, pages 1547-1569, November 2016
Samy Chambi, Daniel Lemire, Robert Godin, Kamel Boukhalfa, Charles Allen, Fangjin Yang, Optimizing Druid with Roaring bitmaps, IDEAS 2016, 2016.

回复收藏 0 原文

栩栩如生 2025-02-18 08:48:37

当条目数量很少时，似乎是这种情况，但是随着我们增加条目的数量，不同的情况变得不那么明显。尽管Lib作者没有证实它（我问在这里，然后跟进 nofollow noreferrer”>在这里）

概率	的	数量块碎块	roaringbitmap	save save save
参赛者	save	;	“	rel = “
https://github.com/roaringbitmap/roaringbitmap/pull/567 ”	href =	save	7744	84
0.1	50000	50048	65616	-31
0.999	50000	50048	65616 <- NOTE it does not increase	-31
0.001	500000	500032	8704	98
0.01	500000	500032	80720	83
0.1	500000	500032	524480	-4
0.999	500000	500032	524480 <- NOTE it does not increase	-4
0.001	50000000	50000000	835232	98
0.01	50000000	50000000 500000	8036368	83
0.1	50000000	50000000	50000000 50016240-0.03	0.999
50000000	50000000 50000000 50000000	50000000	50016240＆lt; - 注意，它	。

似乎不像是这样的场景。带走的是不要盲目使用库，对您的用例进行测试。

this seems to be the case when number of entries are small, But as we increase the number of entries, the different becomes less visible. Although it is not confirmed by the lib author ( i asked here and followed up here)

prob	number of entries	bitset bits	RoaringBitmap bits	saving %
0.001	50000	50048	928	98
0.01	50000	50048	7744	84
0.1	50000	50048	65616	-31
0.999	50000	50048	65616 <- NOTE it does not increase	-31
0.001	500000	500032	8704	98
0.01	500000	500032	80720	83
0.1	500000	500032	524480	-4
0.999	500000	500032	524480 <- NOTE it does not increase	-4
0.001	50000000	50000000	835232	98
0.01	50000000	50000000	8036368	83
0.1	50000000	50000000	50016240	-0.03
0.999	50000000	50000000	50016240 <- NOTE it does not increase	-0.03

looking at this it seems like as number of entries grows they might be using bitmap only behind the scene. The take away is that don't blindly use the library, test for your use case.

回复收藏 0 原文

ゃ人海孤独症 2025-02-18 08:48:37

RoaringBitMap具有三种类型的容器，在我们的方案中，我们主要使用bitmapcontainer。每个bitmapcontainer可以存储65535个元素，占据8k内存。但是，该存储根据高和低16位分配。根据高16位选择相应的存储桶。如果不存在相应的存储桶（元素范围差为65535），则将存储在新的存储桶中。在数据过于稀疏的情况下，这可能会导致许多未填充的水桶，从而导致一些冗余的内存。这可能导致roaringbitmap在某些情况下，比常规位图使用更多的内存。如果稀疏数据的范围差异为65535或元素的数量很少，则使用比位图使用的内存少。它最大的优势是它不需要根据最大偏移分配空间。