有人知道 OLAP 内部原理吗?

发布于 2024-07-17 09:48:01 字数 465 浏览 12 评论 0原文

我对数据库内部结构有所了解。 我之前实际上已经实现了一个小型、简单的关系数据库引擎,使用磁盘上的 ISAM 结构和 BTree 索引以及诸如此类的东西。 这很有趣,而且很有教育意义。 我知道,现在我对 RDBMS 的底层工作原理有了更多的了解,我对仔细设计数据库模式和编写查询有了更多的认识。

但我对多维 OLAP 数据模型一无所知,而且我很难在互联网上找到任何有用的信息。

信息如何存储在磁盘上? 立方体由哪些数据结构组成? 如果 MOLAP 模型不使用带有列和记录的表,那么......怎么办? 尤其是在高维数据中,什么样的数据结构使得MOLAP模型如此高效? MOLAP 实现是否使用类似于 RDBMS 索引的东西?

为什么 OLAP 服务器在处理即席查询方面表现得如此出色? 在普通关系数据库中可能需要数小时才能处理的同类聚合,在 OLTP 多维数据集中只需几毫秒即可处理。 使这成为可能的模型的基本机制是什么?

I know a bit about database internals. I've actually implemented a small, simple relational database engine before, using ISAM structures on disk and BTree indexes and all that sort of thing. It was fun, and very educational. I know that I'm much more cognizant about carefully designing database schemas and writing queries now that I know a little bit more about how RDBMSs work under the hood.

But I don't know anything about multidimensional OLAP data models, and I've had a hard time finding any useful information on the internet.

How is the information stored on disk? What data structures comprise the cube? If a MOLAP model doesn't use tables, with columns and records, then... what? Especially in highly dimensional data, what kinds of data structures make the MOLAP model so efficient? Do MOLAP implementations use something analogous to RDBMS indexes?

Why are OLAP servers so much better at processing ad hoc queries? The same sorts of aggregations that might take hours to process in an ordinary relational database can be processed in milliseconds in an OLTP cube. What are the underlying mechanics of the model that make that possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

少女情怀诗 2024-07-24 09:48:02

我已经实现了几个模仿 OLAP 多维数据集功能的系统,以下是我们为使它们正常工作而所做的一些事情。

  1. 核心数据保存在一个 n 维数组中,全部位于内存中,所有键都是通过指向底层数组的指针层次结构来实现的。 通过这种方式,我们可以为相同的数据拥有多个不同的键集。 数组中的数据相当于事实表,通常它只有几条数据,在一个例子中是价格和销售数量。

  2. 底层数组通常是稀疏的,因此一旦创建它,​​我们就会删除所有空白单元以节省内存 - 大量的硬核指针算术,但它有效。

  3. 由于我们有键的层次结构,因此我们可以很容易地编写例程来轻松地向下/向上钻取层次结构。 例如,我们可以通过月份键来访问年份数据,而月份键又映射到天和/或周。 在每个级别,我们都会聚合数据作为构建多维数据集的一部分 - 使计算速度更快。

  4. 我们没有实现任何类型的查询语言,但我们确实支持在所有轴上向下钻取(在我们最大的立方体中最多 7 个),并且这与用户喜欢的 UI 直接相关。

  5. 我们在 C++ 中实现了核心内容,但现在我认为 C# 可能足够快,但我担心如何实现稀疏数组。

希望有帮助,听起来很有趣。

I've implemented a couple of systems that mimicked what OLAP cubes do, and here are a couple of things we did to get them to work.

  1. The core data was held in an n-dimensional array, all in memory, and all the keys were implemented via hierarchies of pointers to the underlying array. In this way we could have multiple different sets of keys for the same data. The data in the array was the equivalent of the fact table, often it would only have a couple of pieces of data, in one instance this was price and number sold.

  2. The underlying array was often sparse, so once it was created we used to remove all the blank cells to save memory - lots of hardcore pointer arithmetic but it worked.

  3. As we had hierarchies of keys, we could write routines quite easily to drill down/up a hierarchy easily. For instance we would access year of data, by going through the month keys, which in turn mapped to days and/or weeks. At each level we would aggregate data as part of building the cube - made calculations much faster.

  4. We didn't implement any kind of query language, but we did support drill down on all axis (up to 7 in our biggest cubes), and that was tied directly to the UI which the users liked.

  5. We implemented core stuff in C++, but these days I reckon C# could be fast enough, but I'd worry about how to implement sparse arrays.

Hope that helps, sound interesting.

蓝戈者 2024-07-24 09:48:02

Microsoft SQL Server 2008 Analysis Services Unleashed一书详细阐述了 SSAS 的一些特殊性2008年,细节不错。 这并不完全是“SSAS 在幕后的工作原理”,但它很有启发性,尤其是在数据结构方面。 (关于确切的算法,它并不是那么详细/具体。)作为该领域的业余爱好者,我从这本书中收集了一些东西。 这就是 SSAS MOLAP 的全部内容:

  • 尽管大家都在谈论多维立方体,但事实表(又名度量组)数据在第一个近似值中仍然最终存储在基本的 2D 表中,每个事实一行。 许多 OLAP 操作似乎最终都由迭代 2D 表中的行组成。
  • 然而,MOLAP 内部的数据可能比相应 SQL 表内的数据小得多。 一个技巧是每个唯一的字符串仅在“字符串存储”中存储一次。 然后,数据结构可以以更紧凑的形式引用字符串(基本上通过字符串 ID)。 SSAS 还以某种形式压缩 MOLAP 存储中的行。 我认为这种缩小可以让更多数据同时保留在 RAM 中,这很好。
  • 同样,SSAS 通常可以迭代数据的子集而不是完整的数据集。 有一些机制在发挥作用:
    • 默认情况下,SSAS 为每个维度/属性值构建哈希索引; 因此,它“立即”知道磁盘上的哪些页面包含相关数据,例如 Year=1997。
    • 有一个缓存架构,其中相关数据子集存储在与整个数据集分开的 RAM 中。 例如,您可能缓存了一个仅包含几个字段的子多维数据集,并且仅与 1997 年的数据有关。如果查询仅询问 1997 年的情况,那么它将仅迭代该子多维数据集,从而加快速度。 (但请注意,“子立方体”初步近似只是一个二维表格。)
    • 如果您是预定义聚合,那么这些较小的子集也可以在多维数据集处理时预先计算,而不仅仅是按需计算/缓存。
  • SSAS 事实表行的大小是固定的,这可能在某种程度上有所帮助。 (相比之下,在 SQL 中,您可能拥有可变宽度的字符串列。)
  • 缓存架构还意味着,一旦计算出聚合,就不需要从磁盘重新获取并一次又一次地重新计算。

无论如何,这些都是 SSAS 中发挥作用的一些因素。 我不能说没有其他重要的事情。

The book Microsoft SQL Server 2008 Analysis Services Unleashed spells out some of the particularities of SSAS 2008 in decent detail. It's not quite a "here's exactly how SSAS works under the hood", but it's pretty suggestive, especially on the data structure side. (It's not quite as detailed/specific about the exact algorithms.) A few of the things I, as an amateur in this area, gathered from this book. This is all about SSAS MOLAP:

  • Despite all the talk about multi-dimensional cubes, fact table (aka measure group) data is still, to a first approximation, ultimately stored in basically 2D tables, one row per fact. A number of OLAP operations seem to ultimately consist of iterating over rows in 2D tables.
  • The data is potentially much smaller inside MOLAP than inside a corresponding SQL table, however. One trick is that each unique string is stored only once, in a "string store". Data structures can then refer to strings in a more compact form (by string ID, basically). SSAS also compresses rows within the MOLAP store in some form. This shrinking I assume lets more of the data stay in RAM simultaneously, which is good.
  • Similarly, SSAS can often iterate over a subset of the data rather than the full dataset. A few mechanisms are in play:
    • By default, SSAS builds a hash index for each dimension/attribute value; it thus knows "right away" which pages on disk contain the relevant data for, say, Year=1997.
    • There's a caching architecture where relevant subsets of the data are stored in RAM separate from the whole dataset. For example, you might have cached a subcube that has only a few of your fields, and that only pertains to the data from 1997. If a query is asking only about 1997, then it will iterate only over that subcube, thereby speeding things up. (But note that a "subcube" is, to a first approximation, just a 2D table.)
    • If you're predefined aggregates, then these smaller subsets can also be precomputed at cube processing time, rather than merely computed/cached on demand.
  • SSAS fact table rows are fixed size, which presumibly helps in some form. (In SQL, in constrast, you might have variable-width string columns.)
  • The caching architecture also means that, once an aggregation has been computed, it doesn't need to be refetched from disk and recomputed again and again.

These are some of the factors in play in SSAS anyway. I can't claim that there aren't other vital things as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文