日期/时间维度

发布于 2024-10-16 12:30:23 字数 323 浏览 9 评论 0原文

我正在设计一个数据仓库,但我遇到了一个棘手的时间问题。我需要的粒度是每小时的(计算每小时的事件总数),而且我还必须适应一种不方便适合 24 小时周期的轮班模式(事实上,“蓝色”轮班可能不会涵盖相同的情况)一天中的某个时间,持续几天)。

考虑到这一点,我正在考虑 3 种方法之一,

  1. 其中包含 175K 行的单一时间维度。
  2. 雪花时间维度,日历维度中有 7300 行,时间维度有 175k 行,
  3. 独立的维度使得事实表具有事件日期和事件时间的外键。

我倾向于方法 3,因为它允许在连接中单独引用小日历维度,但我将不胜感激任何想法。

I am designing a data warehouse and I have a sticky issue with time. The grain I need is hourly (to calculate aggregate counts of events per hour) and I also have to accommodate a shift pattern that does not conveniently fit inside a 24 hour period (in fact it is possible that 'blue' shift wont cover the same time of day for several days).

With this in mind I am contemplating one of 3 approaches

  1. a single time dimension with 175K rows in it.
  2. a snowflake time dimension with 7300 rows in a calendar dimension and 175k rows in a time dimension
  3. separate dimensions so that the fact table has foreign keys for the event date and for the event time.

I am tending towards approach 3 as it allows the small calendar dimension to be referenced separately in joins, but I would appreciate any thoughts.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

巴黎盛开的樱花 2024-10-23 12:30:23

是的,制造班次很棘手,并且会随着时间的推移而发生变化,通常一个班次在前一天开始,等等。

请记住,这里有两个日历。一种是标准日历,另一种是生产日历——班次属于生产日历。一般来说,生产日历中的一天可能持续多于(或少于)24 小时。

例如:

2011-02-07 23:45 星期一生成的部分可能看起来像

TimeOfProduction = '2011-02-07 23:45'
DateKey = 20110207
TimeKey = 2345
ProductionDateKey = 20110208 (the first shift of the next day started at 22:00)
ProductionTimeKey = 145 (1 hour and 45 minutes of the current production date)     
ShiftKey = 1
ShiftTimeKey = 145 (1 hour and 45 minutes of the current shift)

这样 所以,我的建议是:

  1. 日期维度(每个日期一行)
  2. 时间维度 (24 小时每分钟一行 = 1440 行 + 请参阅下面的注释)
  3. Shift Dimension -- 使用 rw_ValidFrom、(rw_ValidTo)、rw_IsCurrent 输入 2
  4. 维度DateKey 转换为 ProductionDateKey
  5. TimeKey 角色扮演为 ProductionTimeKeyShiftTimeKey
  6. TimeOfProduction (datetime) 也保留在事实表中。
  7. 在 ETL 过程中,应用当前的班次逻辑将 ProductionDateKey、ProductionTimeKey、ShiftKey、ShiftTimeKey 附加到 factPart 表的每一行。

请注意,如果生产日持续超过 24 小时,您可能需要向时间维度添加额外的行。如果使用当地时间并且存在夏令时跳跃,则通常可以。

因此,星星可能看起来像这样

在此处输入图像描述

Yes, manufacturing shifts are tricky and do change over time, often one shift starts day before, etc.

Keep in mind that there are two calendars here. One is the standard calendar and the other is the production calendar -- the shift belongs to the production calendar. In general, a day in production calendar may last more (or less) than 24 hours.

For example:

Part produced on Monday, 2011-02-07 23:45 may look like

TimeOfProduction = '2011-02-07 23:45'
DateKey = 20110207
TimeKey = 2345
ProductionDateKey = 20110208 (the first shift of the next day started at 22:00)
ProductionTimeKey = 145 (1 hour and 45 minutes of the current production date)     
ShiftKey = 1
ShiftTimeKey = 145 (1 hour and 45 minutes of the current shift)

So, my suggestion is:

  1. Plain Date Dimension (one row per date)
  2. Plain Time Dimension (one row per minute for 24 hours = 1440 rows + see note below)
  3. Shift Dimension -- type 2 dimension with rw_ValidFrom, (rw_ValidTo) , rw_IsCurrent
  4. Role-play the DateKey into ProductionDateKey
  5. Role-play the TimeKey into a ProductionTimeKey and ShiftTimeKey.
  6. Keep the TimeOfProduction (datetime) in the fact table too.
  7. During the ETL process, apply the current shift logic to attach ProductionDateKey, ProductionTimeKey, ShiftKey, ShiftTimeKey to each row of the factPart table.

Note that you may need to add extra rows to the Time Dimension if a production day can last more than 24 hours. It usually can if a local time is used and there is a daylight savings time jump.

So, the star may look something like this

enter image description here

凉风有信 2024-10-23 12:30:23

我的 0.02 英镑的价值:

假设没有因考虑转变而产生额外问题(@Andriy M 的问题):

我倾向于打折选项 2,除非有特定的好处(性能、一类的简化)查询等)你可以通过采用它来看到。您没有描述任何此类好处,因此您似乎是为了本身而增加了复杂性。

我个人更喜欢选项 1——概念上最简单、最直接并且 (IMO) 最适合数据仓库方法。

选项 3 具有您提到的优点,但我一直怀疑它涵盖了两种选择:日历维度正如您所描述的那样,但时间维度的选择是 175k 行或 24。我目前无法提供对于这两种选择中的任何一种的争论,只是一种直觉,认为有两种这样的选择。如果转变问题与此相关,它可能会影响这些替代方案之间的选择(如果它们是真正的替代方案)。

如果您希望进一步采用选项 2,选项 3 中列出的替代方案也适用。

My £0.02 for what it is worth:

Assuming that there is no additional issue arising from consideration of the shift (@Andriy M's question):

I would tend to discount option 2 unless there is a specific benefit (performance, simplification of a class of query, etc.) you can see from adopting it. You do not describe any such benefit, so it seems that you are adding complexity for its own sake.

My personal preference would be for option 1 - conceptually the simplest, the most direct, and the (IMO) best fit to data warehouse approaches.

Option 3 has the advantages you mention, but I have the nagging suspicion that it covers two alternatives: in both the calendar dimension is as you describe it, but the choices for the time dimension are 175k rows, or 24. I cannot at present provide arguments for either of these alternatives, only a gut feeling that there are two such choices. If the shift issue IS relevant here, it might influence the choice between these alternatives (if they are genuine alternatives).

If you wish to take option 2 further, the alternatives set out for option 3 are also relevant.

阿楠 2024-10-23 12:30:23

我会选择选项 3。 - 单独的尺寸。优点:

  • 简单性 - 两个相对较小的表 - 时间维度仅加载一次,因为一天中有固定的分钟数。

  • 重用 - 两个独立的维度更有可能与仅具有日期或时间维度的其他事实表共享

  • 通过在事实表中为日期维度提供单独的属性来轻松分区

  • 可扩展性 - 考虑可以添加到日期和时间的属性时间维度随着您的报告需求的增长而变化。对于日期维度,这可能是(以避免每次从日期中提取此信息):年、季度、月、日、周、日期标签(如“2011 年 9 月 12 日”)、月份名称、工作日名称、各种指标(假日指标、季末、月末等)。对于时间维度(为了准确性,可以包含一天中的每一秒),这可以是:小时、分钟、秒、日期部分标签(如“早上”、“晚上”)、工作时间指示器(从 8 开始的秒数: 00:00 到 17:00:00)等。但是,将所有内容都集中在一个维度中将意味着大量冗余。

在我看来,与日开始/结束不一致的班次是记录每个班次开始和结束时间戳的单独事实寓言的良好候选者 - 我的意思是具有以下外键的(无事实)事实表:id_date_start,id_time_start,id_date_end, id_time_end。然后,您可以从事件事实表“钻取”到班次表,以获得每个班次的汇总结果。

编辑:或者模型转变就像另一个维度一样 - 这取决于这样一个事实:对于您而言,转变是否是一个重要的业务流程,您希望通过其属性独立跟踪该流程(但目前我不认为任何其他属性,然后是日期和时间...位置,也许?)或者它只是事件的上下文(因此应该只是一个维度)。

I would choose option 3. - Separate dimensions. Benefits:

  • Simplicity - two relatively small tables - with Time dimension loaded only once as there's fixed number of minutes in a day.

  • Reuse - two separete dimensions are more likely to be shared with other fact tables that can have only Date or Time dimension

  • Easy partitioning by having separate attribute for Date dimension in a fact table

  • Extensibility - think of attributes you could add to Date and Time dimensions as your reporting needs grow. For a Date dimension this could be (to avoid extracting this information each time from date): year, quarter, month, day, week, date label (like "12th September 2011"), month name, weekday name, various indicators (holiday indicator, end of quarter, end of month, etc.). For a Time dimension (which could - for accuracy - contain each second of a day) this could be: hour, minute, second, day part label (like "morning", "evening"), working time indicator (seconds from 8:00:00 to 17:00:00), etc. But having it all in just one dimension would mean a lot of redundancy.

Shifts that are not aligned with day start / end look to me as a good candidate for a separate fact fable recording start and end timestamp for each shift - I mean (factless) fact table with the following foreign keys: id_date_start, id_time_start, id_date_end, id_time_end. Then you can "drill-across" from the events fact table to the shifts table to get aggregate results for each shift.

Edit: Or model shifts just as another dimension - that depends on the fact if for you shift is an important business process that you want to track independently with its attributes (but at the moment I can't think of any other attributes then Date & Time... Location, perhaps?) or if it's just a context of an event (and therefore should be just a dimension).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文