对于我的用例来说，最有效的位向量压缩方法是什么？

发布于 2024-10-13 05:52:31 字数 273 浏览 4 评论 0原文

我正在研究计算生物学的一个项目，我需要存储许多序列之间不同的基因座索引。目前，我正在使用 B+Tree 来实现此目的，但我想对于这样的用例，使用位图索引会更快：两个序列之间只有少量位点不同，平均为 1%，并且它们沿序列几乎均匀分布；所以看起来位图索引压缩还有很大的空间。我的问题是，我无法设法找到一种有效的压缩方法：

允许快速单独位设置/取消设置
允许对位图进行有效范围查询
可能允许两个索引的快速异或/与

提前谢谢您的建议。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨小沫ゞ 2024-10-20 05:52:31

查看 FastBit：

https://sdm.lbl.gov/fastbit/

回复收藏 0 原文

南汐寒笙箫 2024-10-20 05:52:31

您可以使用这样的简单树数据结构：

struct node {
    node * leftChild;
    node * rightChild;
    long mask;
};
struct tree {
    int exponent; // the size of the tree is 2^exponent
    node rootNode;
};

每个节点表示大位数组的一个子数组，即 (2^n)*sizeof(long) 位，n>=0。如果叶节点位于树的底部，则它们在“mask”中存储原始位掩码，否则它们在“mask”中存储 0。这样，'mask'值为0的叶子节点就可以表示位数组中一个 (2^n)*sizeof(long) 大小的空白区域，从而可以高效地存储稀疏位数组。

当然，所有叶节点中的 leftChild 和 rightChild 均为 null。每个其他节点都有一个 leftChild 和 rightChild 指针，并且每个不是叶节点的节点都至少有一个带有掩码的后代节点，该掩码中设置了位。

要在给定索引处查找一点：

bool find_bit_at_index(tree t, long ind) {
    long divider = 1 << (t.exponent - 1);
    node *n = &t.rootNode;
    node *lastNode;
    while (n)
    {
       lastNode = n;
       if (ind >= divider) {
          n = n->rightChild;
          ind -= divider;
       }
       else {
          n = n->leftChild;
       }
       divider >>= 1;
    }
    return lastNode->mask & (1 << ind);
}

一旦您理解了这个想法，构建树并开发其余算法应该很容易。我还没有实际测试代码，因为这不是一个完整的解决方案，可能仍然存在一些拼写错误等。我不是位图索引专家，可能有（可能是）一个现成的包可以做得更好，但这个解决方案很简单并且应该相对有效。与普通位数组相比，1% 可能还不够稀疏，不足以使这一点更好（假设每个长整型存储 64 位，平均而言，不需要超过 2 个长整型就可以设置超过 1 位），但是如果稀疏性的增加超出了空间和时间节省所显示的范围。

You could use a simple tree data structure like this:

struct node {
    node * leftChild;
    node * rightChild;
    long mask;
};
struct tree {
    int exponent; // the size of the tree is 2^exponent
    node rootNode;
};

Every node represents a sub-array of the large bit array that is (2^n)*sizeof(long) bits, n>=0. Leaf nodes store a raw bit mask in 'mask' if they're at the bottom of the tree, otherwise they store 0 in 'mask'. This way, leaf node with a 'mask' value of 0 can represent a (2^n)*sizeof(long) -sized empty area in the bit array, so sparse bit arrays can be stored efficiently.

leftChild and rightChild are of course null in all leaf nodes. Every other node has both a leftChild and rightChild pointer, and every node that isn't a leaf node has at least one descendant node with mask that has bits set in it.

To find a bit at a given index:

bool find_bit_at_index(tree t, long ind) {
    long divider = 1 << (t.exponent - 1);
    node *n = &t.rootNode;
    node *lastNode;
    while (n)
    {
       lastNode = n;
       if (ind >= divider) {
          n = n->rightChild;
          ind -= divider;
       }
       else {
          n = n->leftChild;
       }
       divider >>= 1;
    }
    return lastNode->mask & (1 << ind);
}

Constructing the tree and developing the rest of the algorithms should be easy enough once you understand the idea. I haven't actually tested the code, since this isn't a complete solution, some typos or such might remain. And I'm not a bitmap index expert, there might be (probably is) a ready-made package that does this better, but this solution is simple and should be relatively efficient. 1% might not yet be sparse enough to make this better compared to just a plain bit array (assuming longs storing 64 bits each, it doesn't take more than 2 longs to have more than one bit set on average), but if the sparsity increases beyond that the space and time savings will show.

回复收藏 0 原文

~没有更多了~