数据仓库中每个事实的开始和结束时间段
我被要求向我们的数据仓库添加一个新表。目前,我们将事实分为月度、季度和年度表格,每个表格都有时间维度。每条事实记录都有一个时间值。数据在源系统中按开始和结束时间段生成,结束日期成为事实记录的时间维度值。事实进入月份、季度或年份事实表的流程告诉人们如何理解记录中的日期以及如何使用它们。
我被要求让新表包含每条记录的开始日期和结束日期。有人告诉我,这违反了数据仓库原则,但它更好地代表了数据的生成方式,并允许更灵活地查询数据,例如滚动周期等。
我不是数据仓库专家。我知道每个事实都有一个时间维度是一个原则。我的问题是,违反这一原则会产生什么后果?换句话说,反对这样做的理由是什么?我将来这样做可能会遇到什么问题?在我看来,每个事实的开始和结束时间确实能更好地代表数据,但我承认我的了解不足以完全评估这种设计选择的含义。任何人都可以提供一些前景吗?
编辑: 我很欣赏这些答案。他们至少告诉我,这种做法并不像我想象的那么糟糕。我将澄清有关日期的一件事:它们并不代表有效期,而是代表一个汇总时期。因此,事实记录可以代表任意月份期间计算的某种成分的平均使用磅数。不知道这是否有任何区别,但确实如此。
I have been asked to add a new table to our data warehouse. Currently, we separate our facts into monthly, quarterly, and yearly tables, with time dimensions for each. Each fact record has one time value. The data is generated in the source system by start and end period, and the end date becomes the time dimension value of the fact record. The flow of the fact into either the month, quarter, or year fact table tells one how to understand the dates in the records and how to use them.
I've been asked to have the new table contain start and end dates in each record. I have been told that this violates a data warehousing principle, but it better represents the way the data is generated and allows for more flexible querying of the data, e.g. for rolling periods, etc.
I'm no data warehouse expert. I understand that a single time dimension per fact is a principle. My question is, what are the ramifications of breaking that principle? In other words, what are the arguments against doing so? What problems might I face in the future in doing so? It seems to me that having the start and end periods for each fact does better represent the data, but I admit that I don't know enough to fully evaluate the implications of this design choice. Can anyone perhaps provide some prespective?
Edit:
I appreciate these answers. They at least tell me it isn't as bad a practice as I was led to believe. I will clarify one thing about the dates: They don't represent a period of validity, but rather a period of aggregation. Thus, a fact record may represent an average of pounds used of a certain ingredient as calculated for an arbitrary period of months. Don't know if this makes any difference, but there it is.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
也许是时候买一本好的数据仓库书籍了,我推荐 Kimball Group 的一些书,Ralph Kimball 几乎是快速入门数据仓库的首选。如果有帮助的话,我可以进一步详细说明,但我将从两点开始,这可能会帮助您扭转局面并取得进展。
每个事实有多个时间维度是很常见的。有人在告诉您违反公认的正常做法时向您提供了错误的信息。作为“订单”事实的示例,您通常会有订单日期、发货日期、交货日期、期间等。
如果您使用开始日期和结束日期,则通常表明您正在使用已知的内容作为类型 2 维度,或缓慢变化的维度。情况可能并非如此,但在做出决定之前,请确保您了解缓慢变化的维度。
It might be time to grab a good datawarehousing book, I recommend something from the Kimball Group, Ralph Kimball is pretty much the goto for quickly getting started in datawarehousing. I can elaborate further if it is helpful but I'll start with two points that may help get you turned around and making progress.
It is VERY common to have multiple time dimensions per fact. Someone gave you incorrect information when telling you that violated accepted normal practices. As an example for an "order" fact you would commonly have an order date, a ship date, a delivery date, Period, etc.
If you are using begin and end dates it often indicates you are working with what is known as a type 2 dimension, or a slowly changing dimension. That may not be the case but make sure you understand slowly changing dimensions before you make your decision.
记录开始日期和结束日期的优点是可以更轻松地表示不一致的时间段。这意味着您可以更轻松地连接、聚合和比较以不同粒度记录的数据。从您的描述来看,您的提议似乎没有任何根本上的“错误”。我以前也实现过类似的事情。
我发现表中时间段的最佳模型是使用半开区间。即:间隔为StartDate>=x<表示的周期。结束日期。半开区间使连接和比较变得更容易。
Recording both start and end dates has the advantage that you can represent non-uniform periods of time more easily. It means you can more easily join, aggregate and compare data recorded with different granularity. From your description there doesn't seem to be anything fundamentally "wrong" with what you are proposing. I've implemented similar things before.
I find that the best model for time periods in a table is to use half open intervals. I.e.: the interval is the period represented by StartDate >= x < EndDate. Half open intervals make joins and comparisons easier.
每个事实表都有一个粒度。事实表的粒度指定表的每一行代表什么——一笔交易或某种聚合(每日、每周、每月......)。
我想您当前的表是聚合表,并且(在这些情况下很常见)聚合表中的每个记录都有一个指向日期维度的外键,该外键指向期间的末尾。因此,例如,每周聚合表中的每条记录每周都有一行,并指向一周的最后一天(星期六或星期日)。请注意,在此期间开始时使用另一个密钥将是多余的。
现在,如果您希望允许期间报告的灵活性,那么您应该考虑一笔交易的表粒度,换句话说,表中的一行应该是一笔交易和任何日期/时间FK指向实际交易的时间。
错误的方法是将谷物混合在同一张桌子上。考虑以下
任何包含两行的
sum()
,都会对已包含在第一个条目中的第二个条目进行双重计数。总而言之,如果您的月度、季度和年度聚合不够精细,只需引入一个具有更细粒度的事实表——一天聚合或单笔交易。
Each fact table has a grain. The grain of a fact table specifies what each row of the table represents -- one transaction or some kind of an aggregate (daily, weekly, monthly..).
I suppose your current tables are aggregates and -- as common in these cases -- each record in the aggregate table has a foreign key to the date dimension that points to the end of the period. So, for example, each record in a weekly aggregate table has one row per week and points to the last day of the week (Saturday or Sunday). Note that having another key for the start of this period would simply be redundant.
Now, if you are looking to allow for flexibility of period-reporting, then you should consider the table grain of one transaction, in other words one row in a table should be one transaction and any date/time FK points to the time of the actual transaction.
The wrong approach would be to mix grains in the same table. Consider the following
Any
sum()
which would include both rows, would double-count the second entry which is already included in the first one.To summarize, if your monthly, quarterly and yearly aggregations are not fine enough, simply introduce a fact table with a finer grain -- one day aggregate or a single transaction.
好吧。这就是我处理(将是)相同要求的方式。我使用记录事件日期的新日期字段模拟对事实表的调整。
例如,从上面
EventDateKey Amount RecordType
20110327 700.0 Source
20110329 -500.0 DW 调整
因此,如果您需要聚合(对金额求和),您的数据可以使用 EventDateKey 并通过同一日期维度处理任何期间。这很复杂,因为您正在模拟事实表上的调整,但它提供了您所寻求的所有灵活性,而不会丢失信息的粒度。
Alright. This is the way I handle (will be) the same requirements. I simulate adjustments into my fact table with a new date field that records the event date.
For example, from above
EventDateKey Amount RecordType
20110327 700.0 Source
20110329 -500.0 DW Adjustment
So if you need to aggregate (Sum the amount) your data can use the EventDateKey and work with any period through the same Date dimension. This is complicate because you are simulating a adjustment on your fact table but it gives all the flexiblity you looking for without loosing the grain of the information.