数据仓库-维度建模
我是 BI/数据仓库新手,在构建了一些简单的示例之后,我需要构建一个更复杂的结构。我的项目最初涉及产品许可证,我正在按月/年和程序测量销售量,并仅计算许可证数量。
现在的要求是引入这些指标的跳跃。例如,当您访问某一组许可证时,他们希望看到这些许可证的完全不同的指标。例如,如果 2011 年 3 月售出了 100 个许可证,其中有多少人安装、激活和取消了该产品。 (我们跟踪该信息,但不在 DW 中)。所以,我正在寻找最好的方法来做到这一点......我假设我要做的第一件事是添加已安装、已激活和已取消的三个维度 - 并有三个事实表?或者每个许可证都有一个事实表,并有一行用于取消、安装或激活? (因此一份许可证可以重复)。或者有一个事实表,其中包含已安装、已取消、已激活的不同字段?另外,如何将一个事实表与另一个事实表关联起来?是通过维度,还是它们可以通过其他方式关联?
任何帮助将不胜感激!
编辑:
感谢您的帖子...我还认为第二个选项可能是正确的。但在这个实现中,我有一个独特的问题。因此,衡量的事实之一是销售的许可证数量 - 当然是按日期。假设我添加一行用于安装、取消、激活。要求他们能够看到相互关联的事实。例如,如果我在给定的时间范围内添加单独的行,我可以知道已售出多少行以及安装了多少行。
但他们希望了解在给定的时间范围内,购买了多少,以及安装了多少。例如,如果时间范围是 3 月,并且 3 月份售出了 100 台,那么在这 100 台中,安装了多少台 - 即使他们的安装时间可能比 3 月晚得多,因此行日期将不在他们正在查找的时间范围内在...这是一个常见问题吗?怎么解决的?
I am new to BI/Datawarehousing, and after building some easy samples, I have the need to build a more complex structure. My project initially involved product licenses, and I was measuring how many sold, by month/year and by program, and just counting the number of licenses.
Now the requirement is to introduce jump offs from those metrics. As in, when you come to a certain group of licenses, they want to see a whole different metrics of those. Such as, if 100 licenses were sold in mar 2011, how many of them installed, activated and cancelled the product. (we track that info, but not in the DW). So, I am looking for the best way to do this...I assume the first thing I have to do is add three dimensions for installed, activated and cancelled - and have three fact tables? Or have one fact table with each license, and have a row for cancelled, installed or activated? (so one license may be repeated). Or have one fact table, with different fields for installed, cancelled, activated? Also, how do you relate one fact table to another? Is it through dimensions, or they can related in some other ways?
Any help would be much appreciated!
EDIT:
Thanks for the post... I was also thinking the second option is probably the correct one. But in this implementation, I have a unique problem. So, one of the facts that is measured is the number of licenses that are sold - by date of course. Lets say I add a row for installed, cancelled, activated. The requirement is for them to be able to see a connected fact. For example, if I add individual rows, given a timeframe, I can tell how many were sold, and how many were installed.
But they want to see given a timeframe, how many were bought, and out of them, how many installed. e.g., if the timeframe is march, and 100 were sold in march, out of those 100, how many were installed - even though they could have installed much later than march, and therefore the row date would be not in the timeframe they are looking at....is this a common problem? how is it solved?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
并不真地。许可证出售是事实。它有一个价格。
许可证销售具有日期、产品、客户和计划等维度。
“安装”或“激活”是许可证的状态更改事件。每个许可证都有“事件”(销售、安装、激活等)。
因此许可证具有“销售”事实、“安装”事实和“激活”事实。其中每一个(至少)都与时间有关系。
这提供了最大的灵活性,因为每个事件都可以具有丰富的多个维度。然后可以组织一系列事件来提供许可证的历史记录。
这效果非常好。
您经常需要为简单的计数和求和创建汇总表,以便不必遍历所有事件来获取最常见的仪表板指标。
正确的。您要将事实表中的几行连接在一起。销售事件的行、与安装事件的行的外部连接、与激活事件的行的外部连接等。这只是事实之间的外部连接。
所以。三月份的销售量很容易计算。事件=“销售”。时间是 time.month = "march" 的所有行。简单的。
3 月份转化为安装量的销售量。相同的“三月销售”where 子句外部与这些许可证的所有“安装”事件连接。 “销售额”的计数与 count(*) 相同。安装计数可能会较小,因为外部联接放入了一些空值。
3 月份成为激活的销售数量。 “三月销售”where 子句外部与所有“激活”事件连接。请注意,激活没有日期限制。
这也不太有效,因为表的列规定了业务流程。该业务流程可能会发生变化,并且您将无休止地调整事实表中的列。
话虽如此,它“效果不佳”意味着它没有提供最终的灵活性。在某些情况下,您不需要最终的灵活性。在某些情况下,行业(或法规)可能会定义一个相当固定的结构。
尺寸根据定义。事实表只有两件事——测量值和维度的 FK。
某些维度(例如“许可证实例”)是退化的,因为除了 PK 之外,该维度可能几乎没有可用的属性。
因此,您有一个与许可证相关的“已售出”事实、一个与许可证相关的可选“已安装”事实和一个与许可证相关的可选“激活”事实。许可证是一个对象 ID(数据库代理键),并且可能是许可证标识符本身(可能是许可证序列号或数据库外部的其他内容)。
在执行其他操作之前,请先阅读 Ralph Kimball 的数据仓库工具包。
Not really. A license sale is a fact. It has a price.
A license sale has has dimensions like date, product, customer and program.
An "installation" or "activitation" is a state-change event of a license. You have "events" for each license (sale, install, activate, etc.)
So a license has a "sale" fact, an "installation" fact and an "activation" fact. Each of which is (minimally) a relationship with time.
This gives the most flexibility, because each event can be rich with multiple dimensions. A sequence of events can be then be organized to provide the history of a license.
This works out very well.
You will often want to create summary tables for simple counts and sums to save having to traverse all events for the most common dashboard metrics.
Right. You're joining several rows from the fact table together. A row where the event was sold, outer joined with a row where the event was installed outer joined with row where the event was activated, etc. It's just outer joins among the facts.
So. Count of sales in March is easy. Event = "Sale". Time is all the rows where time.month = "march". Easy.
Count of sales in march which became installs. Same "march sales" where clause outer joined with all "install" events for those licenses. Count of "sales" is the same as count(*). Count of installs may be smaller because the outer join puts in some nulls.
Count of sales in march which became activations. The "march sales" where clause outer joined with all "activation" events. Note that the activation has no date constraint.
This doesn't work out as well because the table's columns dictate a business process. That business process might change and you'll be endlessly tweaking the columns in the fact table.
Having said it doesn't work out "as well" means it doesn't give ultimate flexibility. In some cases, you don't need ultimate flexibility. In some cases, the industry (or regulations) may define a structure that's quite fixed.
Dimensions by definition. A fact table only has two things -- measurements and FK's to dimensions.
Some dimensions (like "license instance") are degenerate because the dimension may have almost no usable attributes other than a PK.
So you have an "sold" fact that ties to a license, a optional "installed" fact that ties to a license and an optional "activate" fact that ties to a license. The license is an object ID (the database surrogate key) and -- perhaps -- the license identifier itself (maybe a license serial number or something outside the database).
Please by Ralph Kimball's Data Warehouse Toolkit before doing anything more.