当前位置：文江博客话题详情

数据仓库多维数据集中应用了哪些数据结构和算法？

发布于 2024-09-02 05:19:48 字数 173 浏览 9 评论 0原文

据我了解，多维数据集是用于聚合和“切片”大量数据的优化数据结构。我只是不知道它们是如何实施的。

我可以想象很多此类技术都是专有的，但是我可以使用任何资源来开始实施我自己的多维数据集技术吗？

可能涉及集合论和大量数学（欢迎提出建议！），但我主要对实现感兴趣：数据结构和查询算法。

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花海 2024-09-09 05:19:48

有一本很棒的书，描述了 SSAS 实现的许多内部细节，包括存储和查询机制细节：

http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016

回复收藏 0 原文

夜巴黎 2024-09-09 05:19:48

在星型模式数据库中，事实通常以最精细的粒度获取和存储。

因此，让我们采用 http:// 中图 10 中的 SalesFact 示例www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx

现在，粒度是产品、时间（以天为粒度）、存储。

假设您希望按月汇总、预聚合（这个特定示例不太可能需要预聚合，但如果销售按客户按分钟详细列出，则可能需要预聚合）。

然后你将有一个 SalesFactMonthly （或者向现有事实表添加粒度区分，因为维度是相同的 - 有时在聚合中，你实际上可能会丢失维度，就像你可能丢失粒度一样，例如，如果你只想要商店而不是按产品）。

ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars

你可以通过这样做得到这个：

INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
    ,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
    ,sf.StoreID
    ,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
    ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month

在多维数据集中发生的情况是你基本上将细粒度星形和预聚合放在一起 - 但每个实现都是专有的 - 有时你甚至可能在多维数据集中没有最细粒度的数据，所以它可以不被报道。但是您可能想要分割数据的每种方式都需要以这种粒度存储，否则您无法以这种方式进行分析。

In a star-schema database, facts are usually acquired and stored at the finest grain.

So let's take the SalesFact example from Figure 10 in http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx

Right now, the grain is Product, Time (at a day granularity), Store.

Let's say you want that rolled up by month, pre-aggregated (this particular example is very unlikely to need pre-aggregation, but if the sales were detailed by customer, by minute, pre-aggregation might be necessary).

Then you would have a SalesFactMonthly (or add a grain discrimination to the existing fact table since the dimensions are the same - sometimes in aggregation, you may actually lose dimensions just like you can lose grain, for instance if you only wanted by store and not by product).

ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars

And you would get this by doing:

INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
    ,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
    ,sf.StoreID
    ,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
    ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month

What happens in cubes is you basically have fine-grain stars and pre-aggregates together - but every implementation is proprietary - sometimes you might not even have the finest-grain data in the cube, so it can't be reported on. But every way you might want to slice the data needs to be stored at that grain, otherwise you can't produce analysis that way.

回复收藏 0 原文