存储大量一位数据的最佳数据结构

发布于 2024-11-05 13:43:12 字数 233 浏览 0 评论 0原文

我想存储大量数据,以便

  1. 可以通过索引访问它们,
  2. 每个数据只是是和否(所以可能每个数据一位就足够了)

我正在寻找具有最高性能和占用最少空间的数据结构。

可能将数据存储在平面内存中,每个数据一位不是一个好的选择,另一方面,使用不同类型的树结构仍然使用大量内存(例如,每个节点中的指针都需要创建这些树,即使每个节点只有一位数据)。

有人有什么想法吗?

I want to store lots of data so that

  1. they can be accessed by an index,
  2. each data is just yes and no (so probably one bit is enough for each)

I am looking for the data structure which has the highest performance and occupy least space.

probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).

Does anyone have any Idea?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

今天小雨转甜 2024-11-12 13:43:12

使用单个内存块并每字节存储 1 位(易于索引,但每字节浪费 7 位)或打包数据(索引稍微复杂一些,但内存效率更高)有什么问题?

What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?

无尽的现实 2024-11-12 13:43:12

在 Java 中,BitSet 可能是一个不错的选择 http:// /download.oracle.com/javase/6/docs/api/java/util/BitSet.html

Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html

软甜啾 2024-11-12 13:43:12

如果我正确理解你的问题,你应该将它们存储在一个无符号整数中,其中你将每个值分配给整数(标志)的一个位。

假设您代表 3 个值,它们可以打开或关闭。然后将第一个分配给 1,第二个分配给 2,第三个分配给 4。然后,您的 unsigned int 可以是 0,1,2,3,4,5,6 或 7,具体取决于哪些值是打开或关闭,然后您检查使用按位比较的值。

If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).

Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.

很快妥协 2024-11-12 13:43:12

取决于语言以及您如何定义“索引”。如果您的意思是索引运算符必须起作用,那么您的语言将需要能够重载索引运算符。如果您不介意使用索引宏或函数,则可以通过将给定索引除以类型中的位数(例如 char 为 8,uint32_t 和变体为 32)来访问第 n 个元素,然后返回以下结果arr[n / n_bits] & arr[n / n_bits] & (1 << (n % n_bits))

Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))

执妄 2024-11-12 13:43:12

看看布隆过滤器:http://en.wikipedia.org/wiki/Bloom_filter

它性能非常好并且节省空间。但请务必阅读下面的细则 ;-):引用上述 wiki 页面。

空的布隆过滤器是一个位数组
m 位,全部设置为 0。必须
也可以是k个不同的哈希函数
已定义,其中每个映射或散列
some 将元素设置为 m 数组之一
具有均匀随机的位置
分配。要添加元素,请输入
它对 k 个哈希函数中的每一个进行
获取 k 个数组位置。将位设置为
所有这些位置到 1. 查询
一个元素(测试它是否在
set),将其提供给每个 k 散列
获取 k 个数组位置的函数。如果
这些位置上的任何位都是
0,该元素不在集合中 – 如果
是的,那么所有的位都会有
插入时已设置为 1。如果
全部为 1,则该元素为
在集合中,或者位已被设置
在插入其他的过程中变为1
元素。设计要求
k个不同的独立哈希函数
对于大 k 来说可能会令人望而却步。对于一个
具有宽输出的良好哈希函数,
如果有的话应该很少
不同之间的相关性
这样的哈希的位字段,所以这
哈希类型可用于生成
多个“不同”的哈希函数
将其输出分割成多个位
字段。或者,可以通过 k
不同的初始值(例如0,
1, ..., k − 1) 到一个哈希函数
取一个初始值;或添加(或
附加)这些值到键。为了
较大的 m 和/或 k,之间的独立性
哈希函数可以放宽
假阳性的增加可以忽略不计
率(Dillinger & Manolios (2004a),
基尔希米岑马赫(2006))。
具体来说,Dillinger &马诺利奥斯
(2004b)表明了有效性
使用增强型双散列或
三重哈希,双重哈希的变体
散列,使用以下方法导出 k 索引
两个或三个的简单算术
使用独立哈希计算的索引
功能。从中删除一个元素
这个简单的布隆过滤器是
不可能的。元素映射到 k
位,虽然设置任何一个
这 k 位为零足以
删除它,这有副作用
删除映射的任何其他元素
在那一点上,我们没有办法
判断是否有这样的元素
已添加。这样的去除将
引入错误的可能性
负面的,这是不允许的。
从a中一次性删除一个元素
布隆过滤器可以通过以下方式模拟
有第二个布隆过滤器
包含已删除的项目。
然而,第二次出现误报
过滤器成为假阴性
复合过滤器,其中不
允许的。在这种方法中重新添加
先前删除的项目不是
可能的,因为必须删除
将其从“已删除”过滤器中删除。然而,
通常情况下,所有的键
可用但价格昂贵
枚举(例如,需要许多
磁盘读取)。当误报
速率太高,过滤器可以
再生;这应该是一个
相对罕见的事件。

Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter

It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.

An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文