数据仓库多维数据集中应用了哪些数据结构和算法?
据我了解,多维数据集是用于聚合和“切片”大量数据的优化数据结构。我只是不知道它们是如何实施的。
我可以想象很多此类技术都是专有的,但是我可以使用任何资源来开始实施我自己的多维数据集技术吗?
可能涉及集合论和大量数学(欢迎提出建议!),但我主要对实现感兴趣:数据结构和查询算法。
谢谢!
I understand that cubes are optimized data structures for aggregating and "slicing" large amounts of data. I just don't know how they are implemented.
I can imagine a lot of this technology is proprietary, but are there any resources that I could use to start implementing my own cube technology?
Set theory and lots of math are probably involved (and welcome as suggestions!), but I'm primarily interested in implementations: the data structures and query algorithms.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有一本很棒的书,描述了 SSAS 实现的许多内部细节,包括存储和查询机制细节:
http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016
There is a fantastic book that describes many internal details of SSAS implementation, including storage and query mechanism details:
http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016
在星型模式数据库中,事实通常以最精细的粒度获取和存储。
因此,让我们采用 http:// 中图 10 中的 SalesFact 示例www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx
现在,粒度是产品、时间(以天为粒度)、存储。
假设您希望按月汇总、预聚合(这个特定示例不太可能需要预聚合,但如果销售按客户按分钟详细列出,则可能需要预聚合)。
然后你将有一个 SalesFactMonthly (或者向现有事实表添加粒度区分,因为维度是相同的 - 有时在聚合中,你实际上可能会丢失维度,就像你可能丢失粒度一样,例如,如果你只想要商店而不是按产品)。
你可以通过这样做得到这个:
在多维数据集中发生的情况是你基本上将细粒度星形和预聚合放在一起 - 但每个实现都是专有的 - 有时你甚至可能在多维数据集中没有最细粒度的数据,所以它可以不被报道。但是您可能想要分割数据的每种方式都需要以这种粒度存储,否则您无法以这种方式进行分析。
In a star-schema database, facts are usually acquired and stored at the finest grain.
So let's take the SalesFact example from Figure 10 in http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx
Right now, the grain is Product, Time (at a day granularity), Store.
Let's say you want that rolled up by month, pre-aggregated (this particular example is very unlikely to need pre-aggregation, but if the sales were detailed by customer, by minute, pre-aggregation might be necessary).
Then you would have a SalesFactMonthly (or add a grain discrimination to the existing fact table since the dimensions are the same - sometimes in aggregation, you may actually lose dimensions just like you can lose grain, for instance if you only wanted by store and not by product).
And you would get this by doing:
What happens in cubes is you basically have fine-grain stars and pre-aggregates together - but every implementation is proprietary - sometimes you might not even have the finest-grain data in the cube, so it can't be reported on. But every way you might want to slice the data needs to be stored at that grain, otherwise you can't produce analysis that way.
通常,数据仓库使用关系数据库,但表不像操作关系数据库那样标准化。
数据仓库是面向主题的。数据仓库主题表通常具有以下特征:
许多索引。
没有连接,除了查找表。
重复数据,主题表为
高度非规范化。
包含派生和聚合信息。
数据仓库中的数据库表以星型模式排列。星型模式基本上是一个带有一组查找表的主题表。查找表的键是主题表中的外键。如果绘制主题表的实体关系图,查找表将像星点一样围绕主题表。
就查询而言,这取决于主题表和行数。通常,查询会花费很长时间(很多分钟,有时几个小时)。
这是一篇帮助您入门的一般文章:开发数据仓库架构
以下是星型模式设计的高级概述: 设计星型模式数据库
Generally, a data warehouse uses a relational database, but the tables aren't normalized like an operational relational database.
A data warehouse is subject oriented. Data warehouse subject tables usually have the following characteristics:
Many indexes.
No joins, except to look up tables.
Duplicated data, the subject table is
highly denormalized.
Contains derived and aggregated information.
The database tables in a data warehouse are arranged in a star schema. A star schema is basically one subject table with an array of look up tables. The keys of the look up tables are foreign keys in the subject table. If you draw an entity relationship diagram of the subject table, the look up tables would surround the subject table like star points.
As far as the queries, that depends on the subject tables and the number of rows. Generally, expect queries to take a long time (many minutes, sometimes hours).
Here's a general article to get you started: Developing a Data Warehouse Architecture
Here's a high level overview of the design of a star schema: Designing the Star Schema Database