数据仓库多维数据集中应用了哪些数据结构和算法?

发布于 2024-09-02 05:19:48 字数 173 浏览 9 评论 0原文

据我了解,多维数据集是用于聚合和“切片”大量数据的优化数据结构。我只是不知道它们是如何实施的。

我可以想象很多此类技术都是专有的,但是我可以使用任何资源来开始实施我自己的多维数据集技术吗?

可能涉及集合论和大量数学(欢迎提出建议!),但我主要对实现感兴趣:数据结构和查询算法。

谢谢!

I understand that cubes are optimized data structures for aggregating and "slicing" large amounts of data. I just don't know how they are implemented.

I can imagine a lot of this technology is proprietary, but are there any resources that I could use to start implementing my own cube technology?

Set theory and lots of math are probably involved (and welcome as suggestions!), but I'm primarily interested in implementations: the data structures and query algorithms.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

花海 2024-09-09 05:19:48

有一本很棒的书,描述了 SSAS 实现的许多内部细节,包括存储和查询机制细节:

http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016

There is a fantastic book that describes many internal details of SSAS implementation, including storage and query mechanism details:

http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016

夜巴黎 2024-09-09 05:19:48

在星型模式数据库中,事实通常以最精细的粒度获取和存储。

因此,让我们采用 http:// 中图 10 中的 SalesFact 示例www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx

现在,粒度是产品、时间(以天为粒度)、存储。

假设您希望按月汇总、预聚合(这个特定示例不太可能需要预聚合,但如果销售按客户按分钟详细列出,则可能需要预聚合)。

然后你将有一个 SalesFactMonthly (或者向现有事实表添加粒度区分,因为维度是相同的 - 有时在聚合中,你实际上可能会丢失维度,就像你可能丢失粒度一样,例如,如果你只想要商店而不是按产品)。

ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars

你可以通过这样做得到这个:

INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
    ,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
    ,sf.StoreID
    ,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
    ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month

在多维数据集中发生的情况是你基本上将细粒度星形和预聚合放在一起 - 但每个实现都是专有的 - 有时你甚至可能在多维数据集中没有最细粒度的数据,所以它可以不被报道。但是您可能想要分割数据的每种方式都需要以这种粒度存储,否则您无法以这种方式进行分析。

In a star-schema database, facts are usually acquired and stored at the finest grain.

So let's take the SalesFact example from Figure 10 in http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx

Right now, the grain is Product, Time (at a day granularity), Store.

Let's say you want that rolled up by month, pre-aggregated (this particular example is very unlikely to need pre-aggregation, but if the sales were detailed by customer, by minute, pre-aggregation might be necessary).

Then you would have a SalesFactMonthly (or add a grain discrimination to the existing fact table since the dimensions are the same - sometimes in aggregation, you may actually lose dimensions just like you can lose grain, for instance if you only wanted by store and not by product).

ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars

And you would get this by doing:

INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
    ,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
    ,sf.StoreID
    ,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
    ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month

What happens in cubes is you basically have fine-grain stars and pre-aggregates together - but every implementation is proprietary - sometimes you might not even have the finest-grain data in the cube, so it can't be reported on. But every way you might want to slice the data needs to be stored at that grain, otherwise you can't produce analysis that way.

心舞飞扬 2024-09-09 05:19:48

通常,数据仓库使用关系数据库,但表不像操作关系数据库那样标准化。

数据仓库是面向主题的。数据仓库主题表通常具有以下特征:

  • 许多索引。

  • 没有连接,除了查找表。

  • 重复数据,主题表为
    高度非规范化。

  • 包含派生和聚合信息。

数据仓库中的数据库表以星型模式排列。星型模式基本上是一个带有一组查找表的主题表。查找表的键是主题表中的外键。如果绘制主题表的实体关系图,查找表将像星点一样围绕主题表。

就查询而言,这取决于主题表和行数。通常,查询会花费很长时间(很多分钟,有时几个小时)。

这是一篇帮助您入门的一般文章:开发数据仓库架构

以下是星型模式设计的高级概述: 设计星型模式数据库

Generally, a data warehouse uses a relational database, but the tables aren't normalized like an operational relational database.

A data warehouse is subject oriented. Data warehouse subject tables usually have the following characteristics:

  • Many indexes.

  • No joins, except to look up tables.

  • Duplicated data, the subject table is
    highly denormalized.

  • Contains derived and aggregated information.

The database tables in a data warehouse are arranged in a star schema. A star schema is basically one subject table with an array of look up tables. The keys of the look up tables are foreign keys in the subject table. If you draw an entity relationship diagram of the subject table, the look up tables would surround the subject table like star points.

As far as the queries, that depends on the subject tables and the number of rows. Generally, expect queries to take a long time (many minutes, sometimes hours).

Here's a general article to get you started: Developing a Data Warehouse Architecture

Here's a high level overview of the design of a star schema: Designing the Star Schema Database

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文