小数据大数据集多重索引:空间效率低下?

发布于 2024-11-25 16:23:37 字数 550 浏览 2 评论 0原文

我根本不是数据库设计方面的专家,所以在尝试用 CS 术语翻译之前,我会用简单的语言表达我的需求:我正在尝试找到快速迭代大型子集的正确方法(例如约 100Mo 的双)的数据,在一个可能非常大的数据集中(比如几个 Go)。 我有基本上由 4 个整数(键)和值组成的对象,一个简单的结构(1 个双精度型 1 个短型)。 由于我的键只能接受少量值(数百个),因此我认为将数据保存为树是有意义的(键的深度为 1,值是叶子,至少在我的幼稚视图中很像 XML 的 XPath) 。

我希望能够根据键值/这些键值的函数迭代叶子的子集。过滤的组合键会有所不同。我认为这就是所谓的横向搜索?
因此,为了避免比较相同键的 n 次,理想情况下,我需要通过键的每个排列对数据结构进行索引(12 种可能性: !4/!2 )。这似乎就是 boost::multi_index 的用途,但是,除非我忽略了这一点,否则这样做的方式实际上是构建这 12 个树结构,将指向我的值节点的指针存储为树叶。我想考虑到我的值与键相比的尺寸很小,这将是极其空间效率低下的。

任何有关我应该使用的设计/数据结构的建议,或有关这些主题的简明教育材料的指示,将不胜感激。

I am not at all an expert in database design, so I will put my need in plain words before I try to translate it in CS terms: I am trying to find the right way to iterate quickly over large subsets (say ~100Mo of double) of data, in a potentially very large dataset (say several Go).
I have objects that basically consist of 4 integers (keys) and the value, a simple struct (1 double 1 short).
Since my keys can take only a small number of values (couple hundreds) I thought it would make sense to save my data as a tree (1 depth by key, values are the leaves, much like XML's XPath in my naive view at least).

I want to be able to iterate through subset of leaves based on key values / a fonction of those keys values. Which key combination to filter upon will vary. I think this is call a transversal search ?
So to avoid comparing n times the same keys, ideally I would need the data structure to be indexed by each of the permutation of the keys (12 possibilities: !4/!2 ). This seems to be what boost::multi_index is for, but, unless I'm overlooking smth, the way this would be done would be actually constructing those 12 tree structure, storing pointers to my value nodes as leaves. I guess this would be extremely space inefficient considering the small size of my values compared to the keys.

Any suggestions regarding the design / data structure I should use, or pointers to concise educational materials regarding these topics would be very appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

农村范ル 2024-12-02 16:23:37

使用 Boost.MultiIndex,您不需要多达 12 个索引(顺便说一句,4 个元素的排列数量是 4!= 24,而不是 12)来覆盖包含 4 个键的特定子集的所有查询:感谢使用复合键,只要稍加巧妙,6 个索引就足够了。

巧合的是,几年前我在博客中提供了一个示例,展示了如何以几乎完全符合您的特定场景的方式执行此操作:

使用 Boost.MultiIndex 进行多属性查询

提供了源代码,您只需稍加修改即可满足您的需求。同一博客中的一系列文章也提供了该构造的理论依据:

这背后的数学原理并不简单,您可能想安全地忽略它:不过,如果您需要帮助理解它,请毫不犹豫地对博客文章发表评论。

这个容器使用了多少内存?在典型的 32 位计算机中,对象的大小为 4*sizeof(int)+sizeof(double)+sizeof(short)+padding,通常会产生 32 个字节(在 Win32 上使用 Visual Studio 检查)。 Boost.MultiIndex 为每个索引添加了 3 个字(12 字节)的开销,因此对于容器的每个元素,您将获得

32+6*12 = 104 字节 + 填充。

我再次在 Win32 上使用 Visual Studio 检查,获得的大小是每个元素 128 字节。如果你有 10 亿 (10^9) 个元素,那么 32 位是不够的:使用 64 位操作系统很可能会使对象的大小增加一倍,因此所需的内存将达到 256 GB,这是相当强大的野兽(不知道你是否使用像这样巨大的东西。)

With Boost.MultiIndex, you don't need as many as 12 indices (BTW, the number of permutations of 4 elements is 4!=24, not 12) to cover all queries comprising a particular subset of 4 keys: thanks to the use of composite keys, and with a little ingenuity, 6 indices suffice.

By some happy coincindence, I provided in my blog some years ago an example showing how to do this in a manner that almost exactly matches your particular scenario:

Multiattribute querying with Boost.MultiIndex

Source code is provided that you can hopefully use with little modification to suit your needs. The theoretical justification of the construct is also provided in a series of articles in the same blog:

The maths behind this is not trivial and you might want to safely ignore it: if you need assistance understanding it, though, do not hesitate to comment on the blog articles.

How much memory does this container use? In a typical 32-bit computer, the size of your objects is 4*sizeof(int)+sizeof(double)+sizeof(short)+padding, which typically yields 32 bytes (checked with Visual Studio on Win32). To this Boost.MultiIndex adds an overhead of 3 words (12 bytes) per index, so for each element of the container you've got

32+6*12 = 104 bytes + padding.

Again, I checked with Visual Studio on Win32 and the size obtained was 128 bytes per element. If you have 1 billion (10^9) elements, then 32 bits is not enough: going to a 64-bit OS will most likely double the size of obejcts, so the memory needed would amount to 256 GB, which is quite a powerful beast (don't know whether you are using something as huge as this.)

胡渣熟男 2024-12-02 16:23:37

B-Tree 索引位图索引 是使用的两个主要索引,但它们不是唯一的索引。你应该探索它们。 帮助您入门的内容

评估何时使用 B-Tree 以及何时使用 Bitmap 的文章

B-Tree index and Bitmap Index are two of the major indexes used, but they aren't the only ones. You should explore them. Something to get you started .

Article evaluating when to use B-Tree and when to use Bitmap

渡你暖光 2024-12-02 16:23:37

老实说,这取决于访问它的算法。如果这个结构需要常驻,并且你可以承受内存消耗,那么就这样做。 multi_index 很好,但如果它位于标头中,它会破坏您的编译时间。

如果你只需要一次遍历,那么构建结构将是一种浪费。像 next_permutation 这样的东西可能是一个很好的起点。

It depends on the algorithm accessing it, honestly. If this structure needs to be resident, and you can afford the memory consumption, then just do it. multi_index is fine, though it will destroy your compile times if it's in a header.

If you just need a one time traversal, then building the structure will be kind of a waste. Something like next_permutation may be a good place to start.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文