数据仓库中的时间和日期维度
我正在构建一个数据仓库。每个事实都有它的时间戳
。我需要按天、月、季度创建报告,但也需要按小时创建报告。查看示例,我发现日期往往保存在维度表中。
(来源:etl-tools.info)
但我认为,这对于时间来说没有意义。维度表会不断增长。另一方面,使用日期维度表进行 JOIN 比在 SQL
中使用日期/时间函数更有效。
您的意见/解决方案是什么?
(我正在使用Infobright)
I'm building a data warehouse. Each fact has it's timestamp
. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL
.
What are your opinions/solutions ?
(I'm using Infobright)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Kimball 建议 单独的时间和日期维度:
Kimball recommends having separate time- and date dimensions:
我的猜测是,这取决于您的报告要求。
如果您需要
每天 10:00:00 到 10:59:59 之间的含义,那么我会使用时间维度,因为它比
为每一行评估 date_part() 函数更快。
您仍然应该将时间戳保留在事实表中,以便在天数范围内进行聚合,例如:
在使用维度字段时这会变得很尴尬。
通常,时间维度具有分钟分辨率,因此有1440行。
My guess is that it depends on your reporting requirement.
If you need need something like
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.
时间应该是数据仓库的一个维度,因为您经常需要对其进行聚合。您可以使用 snowflake-Schema 来减少开销。总的来说,正如我在评论中指出的那样,时间似乎是一个异常高的分辨率。如果你坚持这样做,将一天中的某个时刻作为一个单独的维度可能会有所帮助,但我无法告诉你这是否是一个好的设计。
Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.
我建议为日期和时间设置单独的维度。日期维度将为每个日期提供 1 条记录,作为已识别的有效日期范围的一部分。例如:1980 年 1 月 1 日至 2025 年 12 月 31 日。
时间的单独维度有 86400 条记录,每秒有一条由时间键标识的记录。
在事实记录中,如果您需要日期和时间,请添加引用这些一致维度的两个键。
I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.