对于我的用例来说,最有效的位向量压缩方法是什么?
我正在研究计算生物学的一个项目,我需要存储许多序列之间不同的基因座索引。目前,我正在使用 B+Tree 来实现此目的,但我想对于这样的用例,使用位图索引会更快:两个序列之间只有少量位点不同,平均为 1%,并且它们沿序列几乎均匀分布;所以看起来位图索引压缩还有很大的空间。 我的问题是,我无法设法找到一种有效的压缩方法:
- 允许快速单独位设置/取消设置
- 允许对位图进行有效范围查询
- 可能允许两个索引的快速异或/与
提前谢谢您的建议。
I'm working on a project in computational biology and I need to store an index of locuses that differ between many sequences. For now, I'm using a B+Tree for that purpose, but I guess using a bitmap index would be way faster for such a use case : only a small number of locus differ between two sequences, 1% on average, and they are nearly equally distributed along the sequence; so it seems like there is a lot of room for bitmap index compression.
My problem is that I cannot manage to find a compression method that can efficiently:
- allow fast individual bit setting/unsetting
- permit efficient range queries over the bitmap
- possibly allow fast XOR-ing/AND-ing of two indexes
Thx in advance for your suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看 FastBit:
https://sdm.lbl.gov/fastbit/
Check out FastBit:
https://sdm.lbl.gov/fastbit/
您可以使用这样的简单树数据结构:
每个节点表示大位数组的一个子数组,即 (2^n)*sizeof(long) 位,n>=0。如果叶节点位于树的底部,则它们在“mask”中存储原始位掩码,否则它们在“mask”中存储 0。这样,'mask'值为0的叶子节点就可以表示位数组中一个 (2^n)*sizeof(long) 大小的空白区域,从而可以高效地存储稀疏位数组。
当然,所有叶节点中的 leftChild 和 rightChild 均为 null。每个其他节点都有一个 leftChild 和 rightChild 指针,并且每个不是叶节点的节点都至少有一个带有掩码的后代节点,该掩码中设置了位。
要在给定索引处查找一点:
一旦您理解了这个想法,构建树并开发其余算法应该很容易。我还没有实际测试代码,因为这不是一个完整的解决方案,可能仍然存在一些拼写错误等。我不是位图索引专家,可能有(可能是)一个现成的包可以做得更好,但这个解决方案很简单并且应该相对有效。与普通位数组相比,1% 可能还不够稀疏,不足以使这一点更好(假设每个长整型存储 64 位,平均而言,不需要超过 2 个长整型就可以设置超过 1 位),但是如果稀疏性的增加超出了空间和时间节省所显示的范围。
You could use a simple tree data structure like this:
Every node represents a sub-array of the large bit array that is (2^n)*sizeof(long) bits, n>=0. Leaf nodes store a raw bit mask in 'mask' if they're at the bottom of the tree, otherwise they store 0 in 'mask'. This way, leaf node with a 'mask' value of 0 can represent a (2^n)*sizeof(long) -sized empty area in the bit array, so sparse bit arrays can be stored efficiently.
leftChild and rightChild are of course null in all leaf nodes. Every other node has both a leftChild and rightChild pointer, and every node that isn't a leaf node has at least one descendant node with mask that has bits set in it.
To find a bit at a given index:
Constructing the tree and developing the rest of the algorithms should be easy enough once you understand the idea. I haven't actually tested the code, since this isn't a complete solution, some typos or such might remain. And I'm not a bitmap index expert, there might be (probably is) a ready-made package that does this better, but this solution is simple and should be relatively efficient. 1% might not yet be sparse enough to make this better compared to just a plain bit array (assuming longs storing 64 bits each, it doesn't take more than 2 longs to have more than one bit set on average), but if the sparsity increases beyond that the space and time savings will show.