java:稀疏位向量

发布于 2024-09-05 06:22:46 字数 1537 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

恍梦境° 2024-09-12 06:22:47

TL;DR 转到这里 Java 中的高效稀疏 BitSet 实现

我知道这是一个“老”问题,但是有我偶然发现了这篇文章同样的问题。虽然答案很好,但我最终还是不满意。经过进一步挖掘,我想我已经找到了 Java 中稀疏 BitSet 问题的“明确”答案。

本演示中,作者 Bruce Haddon 博士,讨论了他的研究人员为创建标准 Java BitSet 的高内存效率和高性能替代品所做的努力。

他的演示文稿的原始链接已失效,但我联系了 Haddon 博士,并在此处保留了代码和演示文稿:

https: //github.com/brettwooldridge/SparseBitSet

我强烈建议您阅读此演示文稿。即使您对稀疏位集不感兴趣,这也是一本令人着迷的读物,它更多的是关于解决问题的真正本质......

幻灯片:是计算机科学、软件工程还是黑客?

TL;DR go here Efficient Sparse BitSet implementation in Java

I know this is an "old" question, but having the same question I stumbled across this post. While the answers are good, I was ultimately not satisfied. After further digging, I think I've come across the "definitive" answer to the question of sparse BitSets in Java.

In this presentation the author, Dr. Bruce Haddon, discusses the efforts of his researchers to create a highly memory-efficient and high-performance replacement for the standard Java BitSet.

The original links to his presentation are dead, but I contacted Dr. Haddon and have preserved both the code and presentation here:

https://github.com/brettwooldridge/SparseBitSet

I cannot recommend reading this presentation more highly. It is a fascinating read even if you have no interest in sparse bit sets, it is more about the true nature of problem solving...

Slides: Is it Computer Science, Software Engineering, or Hacking?

∞琼窗梦回ˉ 2024-09-12 06:22:47

如果它真的很稀疏(例如,加载量小于 1%),那么使用按位索引索引的哈希表可能会很好;您只需知道表中索引是否存在即可知道该位分别是 1 或 0。

如果密度高出几个百分点,则可以使用按位索引除以 64 进行索引的哈希表,并将包含实际位的字存储在哈希表中。如果哈希表包含 int(N/64)(V>>(N) 的值 V,则设置位 N mod 64))&1 为真。

这两个答案都假设您想要优化对位的随机访问。如果您想通过索引优化对位的顺序(或其他访问),那么您可能需要稀疏矩阵结构,根据预期密度使用相同类型的低级位向量表示。请参阅稀疏矩阵

If its really sparse (e.g., less than 1% loading), then using a hash table indexed by bit index is probably pretty good; mere presence or absence of the index in the table is all you need to know if the bit is one or zero respectively.

If the density is upwards of a few percent, you can use a hash table indexed by bit index divided by 64, and store long words in the hash table containing actual bits. Bit N is set if the hash table contains value V for int(N/64) and (V>>(N mod 64))&1 is true.

Both of these answers assume you want to optimize random access to bits. If you want to optimize sequential (or other access) to bits by index, then you may want a sparse matrix structure, using the same kind of low-level bit vector representation depending on expected density. See Sparse Matrices

枫林﹌晚霞¤ 2024-09-12 06:22:47

colt 库 具有稀疏矩阵(1D、2D 和 3D)。它还具有高效的 BitVector,每个值 1 位,而不是像 boolean[] 那样具有 8 位。

然而,稀疏矩阵不直接支持位——仅支持双精度和对象。您可以通过将位索引映射到长索引(bitIndex>>6)来包装一维稀疏双矩阵,因为每个长索引保存64位,将检索到的 double 转换为原始 long 值,并使用位操作访问检索到的 long 的位。需要做一点工作,但远不及自己实现稀疏向量。一旦您的包装器工作,您可能会避免将双精度数转换为长整型数,并使用双精度一维稀疏矩阵的可用 Colt 源代码作为起点来实现真正的稀疏长一维矩阵。

编辑:更多信息。 Colt 向量/矩阵最初不需要内存来存储,假设所有位(长整型)最初都是 0。将值设置为非零会消耗内存。将值设置回 0 会继续消耗内存,尽管零值的内存会定期回收。

如果位确实稀疏,使得每个后备长整型值仅具有一位集,则存储开销将非常低,每个实际存储位需要 64 位。但正如您提到的典型情况是 20-40% 稀疏,那么开销会低得多,如果位聚集在范围内,例如从 0-100、然后 1000-1100 和 2000-2200 的位,则可能不会浪费存储空间(值以十六进制表示。)总体而言,只有 1/16 的区域被分配给位,但集群意味着位的存储没有浪费空间。

The colt library has sparse matrices (1D, 2D and 3D). It also has an efficient BitVector, with 1 bit per value, rather than 8-bits as boolean[] does.

However, the sparse matrices do not support bits directly - only doubles and objects. You could wrap the 1D sparse double matrix by maping bit index to long indices (bitIndex>>6) since each long holds 64 bits, convert the retrieved double to a raw long value, and use bit manipulation to access the bits of the retrieved long. A little work, but nowhere near as much as implementing the sparse vector yourself. Once your wrapper is working, you might avoid converting doubles to longs, and implement a real sparse long 1d matrix using the available Colt source code for the double 1D sparse matrix as a starting point.

EDIT: More info. The Colt vectors/matrices require no memory initially for storage, assuming all bits (longs) are initially 0. Setting a value to non-zero consumes memory. Setting the value back to 0 continues to consume memory, although memory for zero values is reclaimed periodically.

If the bits are truly sparse, such that each backing long value only has one bit set, then the storage overhead will be very poor, requiring 64-bits per actual bit stored. But as you mention typical case is 20-40% sparse, then the overhead will be much lower, with possibly no wasted storage if bits are clustered in ranges, e.g. bits from 0-100, then 1000-1100, and 2000-2200 (values in hex.) Overall, only 1/16 of the region is assigned to bits, but the clustering means that the bits are stored with no wasted space.

梦里泪两行 2024-09-12 06:22:47

聚会已经很晚了,但这个问题的 PageRank 相当高。 Roaring Bitmap 已经吃掉了很多这样的用例。

Very late to the party here, but this question has pretty high PageRank. Roaring Bitmap has eaten a lot of these use cases.

爺獨霸怡葒院 2024-09-12 06:22:47

CERN COLT 广泛用于向量和矩阵计算,并且具有稀疏矩阵,但并不专门用于位向量。

http://acs.lbl.gov/软件/colt/api/cern/colt/matrix/impl/SparseObjectMatrix1D.html

CERN COLT is widely used for vector and matrix computation, and has sparse matrices, but isn't specifically used for bit vectors.

http://acs.lbl.gov/software/colt/api/cern/colt/matrix/impl/SparseObjectMatrix1D.html

橘亓 2024-09-12 06:22:47

一个哈希表,仅仅通过键的存在或不存在就可以告诉你一些事情?那将是一个哈希集!我对 BitSet 上的集合(甚至是散列集合)的性能持怀疑态度。这实际上取决于速度或内存是否是主要驱动因素。

A hash table where the mere presence or absence of the key tells you something? That would be a hash set then! I'm skeptical of the performance of a set (even a hashed one) over the BitSet. It really depends on whether speed or memory is the primary driver.

飘落散花 2024-09-12 06:22:47

您可以尝试 JavaEWAH 库。

https://code.google.com/p/javaewah/

根据您的问题,它可能会很合适。

(它被 Apache Hive 和其他人使用。)

You could try the JavaEWAH library.

https://code.google.com/p/javaewah/

Depending on your problem it may be a good fit.

(It is used by Apache Hive and others.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文