数据仓库性能差异非规范化时间

发布于 2024-10-20 16:36:01 字数 200 浏览 1 评论 0原文

我在书中读到,如果将时间分成单独的列,那么它是真正的性能提升器。例如日、月、年等等...

  1. 数据库是否已经有了一些智能方法来处理时间列上的索引,因此分割时间并添加数百万个索引变体已经过时了?

  2. 有性能差异方面的经验吗?

可能的查询是周一上午 13:00-14:00 之间的销售情况。

In books I read that it is a real performance booster if you split the time into separate column. e.g day,month,year and so on...

  1. Do databases already have some smart approch for handling indicies over time columns, so that splitting the time and adding millions of index variantes is obsolete?

  2. Any experiance in performance difference?

A possible query would be sales on monday morning between 13:00-14:00 o'clock.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦初启 2024-10-27 16:36:02

您概述的具体场景(每周一 13:00-14:00)无法通过针对日期时间数据的普通索引正确满足。

需要将日期时间数据分解为星期几+时间部分才能获取该信息。对于这种情况,将其分成一周中的某一天和一天中的时间(小时)的另一列会更好,并且可以单独索引或作为复合索引(跨两者)。

性能非常不同 - 查询不是使用星期几 + 一天中的时间索引来查看 1/168 的数据(理论平均值)或更实际的大约 1/50 的数据(工作时间)否则必须运行 2 个转换(以获取星期几 + 一天中的时间组件),然后通过过滤器运行它。

The specific scenario you outline (13:00-14:00 every Monday) cannot be properly served by normal indexes against datetime data.

It would require a lot of dissecting of the datetime data into the day of week + the time portion to get at that info. For this scenario, breaking it into a column for day of week and another for time of day (hour) will work a lot better and can be indexed separately or as a composite (across both).

Performance is very different - instead of looking at 1/168th of the data (theoretical average) or more realistically about 1/50th of the data (working hours) using indexes on day-of-week + time-of-day, the query would otherwise have to run 2 transformations (to get day-of-week + time-of-day components) then run that through a filter.

海夕 2024-10-27 16:36:02

在许多星型模式中,具有时间维度是有用的。在该维度表中,明确列出星期几、月份等可能会很有用。其中许多属性可以通过 SQL 方言中的内置函数来访问。而且,使用这些函数所需的磁盘 I/O 比具体化此数据所需的磁盘 I/O 更少。但如果日历函数看起来像数据,那么在给定时间片上编写报告的艺术就会变得更加容易。

这真正有用的是,您的企业有一个特殊的“公司日历”,其中日期可以属于称为“财政季度”的单位,这些单位不容易映射到日-月-年。如果将所有日历怪癖放入一个生成时间维度表的程序中,它可以使仓库代码的其余部分变得更加干净。

与任何维度表一样,正确设置粒度非常重要。如果您每天只需要一行,则只需 3,650 多行即可存储十年的日期,按照今天的标准来看,这只是一个很小的表。在某些情况下,“轮班”(8 小时)被证明是正确的粒度。这取决于数据的用途。

无论走哪条路,在建仓时都要做好数据经历“蜕变”的准备,在面对意外需求时要做好接受“考验”的准备。

It is useful, in many star schemas, to have a time dimension. In that dimension table, it can be useful to have the day of the week, the month, and so on explicitly laid out. Many of these attributes can be accessed by built in functions in your dialect of SQL. And it takes less disk I/O if you use the functions than if you materialize this data. But it makes the art of composing reports over given time slices so much easier if calendar functions just look like data.

Where this can be really helpful is is your enterprise has a peculiar "company canlendar" where dates can belong to units called "fiscal quarters" that are not easliy mapped into day-month-year. If you put all the calendar quirks into a single program that generates the time dimension table, it can make the rest of your warehouse code a whole lot cleaner.

As with any dimension table, it's very important to set the granularity right. If you only want one row per day, you can store ten years worth of dates with just over 3,650 rows, a tiny table by today's standards. In some cases, a "shift" (an 8 hour period) turns out to be the right granularity. It depends on the uses of the data.

No matter which way you go, be prepared for your data to undergo a "metamorphosis" when you set up the warehouse, and be prepared to face a "trial" when faced with unexpected requirements.

无人接听 2024-10-27 16:36:02

基于函数的索引是一种可能的选择。索引视图是另一种。

仅仅创建一个新属性并不能提高性能。任何性能差异都是由于数据存储和索引方式的根本变化造成的。因此,说创建单独的日期和时间列可以提高性能是一种误导,而且过于简单化。然而,出于其他原因,创建单独的时间列可能是一个好主意,例如:清晰、简化查询逻辑或充分利用 DBMS 日期/时间类型和其他功能。

A function-based index is one possible option. An indexed view is another.

Just creating a new attribute isn't the thing that improves performance. Any performance difference is due to underlying changes to the way the data is stored and indexed. So it's misleading and very over-simplistic to say that creating separate date and time columns is a performance booster. However, creating a separate time column may well be a good idea for other reasons, for example: clarity, simplifying query logic or taking best advantage of DBMS date/time types and other features.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文