在数据仓库中存储缓慢变化的属性的最佳方法是什么?
在经典的关系数据仓库设计中,缓慢变化的属性(不经常变化的属性)存储在具有类似于以下架构的表中:
EntityKey、StartDate、EndDate、Attribute1、Attribute2、Attribute3...
(这可能与快速变化的属性形成对比,后者可以存储为:
EntityKey、Timestamp、Attribute1、Attribute2、Attribute3...)
我不喜欢这种方法的地方是有很多重复的信息。如果属性 1 每周更改一次,而属性 2 每年仅更改一次,则最终会每周重复属性 2。如果你有很多可以加起来的属性。
当然,您可以为每个时间间隔创建一个这样的表(一个用于每周属性的表,一个用于每月属性的表,一个用于每年属性的表等),但在现实世界中,各种属性会在不同的时间点发生变化,不一定根据任何图案。此外,对于某些实体,相同的属性可能比其他实体更频繁地更改。
我很好奇是否有人对这些不经常更改但频率不同的属性的不同存储模式有建议或想法(即有些每天更改,其他每周更改等)。也许有我不知道的(非关系型)数据库技术更适合此类问题?
In classic relational data warehouse design slow-changing attributes (attributes that change infrequently) are stored in a table with a schema similar to this:
EntityKey, StartDate, EndDate, Attribute1, Attribute2, Attribute3...
(This might be by contrast to fast-changing attributes, which could be stored as:
EntityKey, Timestamp, Attribute1, Attribute2, Attribute3...)
What I don't like about this approach is that there is a lot of repeated information. If Attribute1 changes once a week and Attribute2 only once a year, you end up repeating Attribute2 every week redundantly. If you have a lot of attributes that can add up.
Of course, you could create one such table for each time interval (a table for weekly attributes, one for monthly, one for yearly, etc.) but in the real world various attributes will change at various points in time, not necessarily according to any pattern. Also for some entities the same attribute may change more frequently than for others.
I am curious if anyone has suggestions or ideas for different storage schemas for such attributes that change infrequently but with different frequencies (i.e. some change daily, others weekly, etc.). Perhaps there are (non-relational) database technologies I am not aware of that are more suited for this type of problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这就是仓库的意义所在。重复信息以表示 (a) 发生的历史事实和 (b) 减少连接数量。
错误的。它根本不会加起来很快。
您似乎在谈论星型模式中的维度。它们相对较小。与事实表相比,存储是无关紧要的。不要标准化或优化。将此视为“预连接”、“高速”、“非规范化”、“仅报告”表。对非标准化数据感到满意:它更快。
如果您谈论的是事实表,那么这些更改具有不同的时间粒度,并且从不应该位于同一个事实表中。
That's the point of a warehouse. Repeat the information to represent (a) historical facts as they occurred and (b) reduce the number of joins.
Wrong. It doesn't add up very quickly at all.
You appear to be talking about dimensions in a star schema. They're relatively small. The storage is irrelevant compared to the fact tables. Don't normalize or optimize. Consider this a "pre-joined", "high-speed", "denormalized", "reporting-only" table. Be happy with non-normalized data: it's faster.
If you're talking about a Fact Table, then these changes have different time granularity and never should have been in the same fact table.