在数据仓库中表示时间间隔的最佳实践是什么?

发布于 2024-10-04 07:24:47 字数 972 浏览 0 评论 0原文

特别是我正在处理类型 2 缓慢变化的维度并且需要表示时间间隔 a特定记录处于活动状态,即对于每个记录,我都有一个 StartDateEndDate。我的问题是使用封闭 ([StartDate,EndDate]) 还是半开放 ([StartDate,EndDate)< /em>) 间隔来表示这一点,即是否包含间隔中的最后一个日期。举一个具体的例子,假设记录 1 从第 1 天到第 5 天处于活动状态,从第 6 天开始,记录 2 变得活动。我是否将记录 1 的 EndDate 设置为 5 或 6?

最近,我开始思考半开区间最好基于Dijkstra: 为什么编号应该从零开始 以及数组切片的约定和 Python 中的 range() 函数。在数据仓库上下文中应用此方法,我会看到半开间隔约定的优点,如下所示:

  • EndDate-StartDate 给出记录处于活动状态的时间
  • 验证:StartDate下一条记录的 EndDate 将等于上一条记录的 EndDate,这很容易验证。
  • 面向未来:如果我后来决定将粒度从每日更改为更短的时间,那么切换日期仍然保持精确。如果我使用闭合间隔并存储带有午夜时间戳的 EndDate,那么我将必须调整这些记录以适应这一点。

因此,我倾向于使用半开区间方法。然而,如果有一些广泛采用的使用闭区间方法的行业惯例,那么我可能会倾向于采用这种方法,特别是如果它是基于实现此类系统的实际经验而不是我的抽象理论。

In particular I am dealing with a Type 2 Slowly Changing Dimension and need to represent the time interval a particular record was active for, i.e. for each record I have a StartDate and an EndDate. My question is around whether to use a closed ([StartDate,EndDate]) or half open ([StartDate,EndDate)) interval to represent this, i.e. whether to include the last date in the interval or not. To take a concrete example, say record 1 was active from day 1 to day 5 and from day 6 onwards record 2 became active. Do I make the EndDate for record 1 equal to 5 or 6?

Recently I have come around to the way of thinking that says half open intervals are best based on, inter alia, Dijkstra:Why numbering should start at zero as well as the conventions for array slicing and the range() function in Python. Applying this in the data warehousing context I would see the advantages of a half open interval convention as the following:

  • EndDate-StartDate gives the time the record was active
  • Validation: The StartDate of the next record will equal the EndDate of the previous record which is easy to validate.
  • Future Proofing: if I later decide to change my granularity from daily to something shorter then the switchover date still stays precise. If I use a closed interval and store the EndDate with a timestamp of midnight then I would have to adjust these records to accommodate this.

Therefore my preference would be to use a half open interval methodology. However if there was some widely adopted industry convention of using the closed interval method then I might be swayed to rather go with that, particularly if it is based on practical experience of implementing such systems rather than my abstract theorising.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

∞觅青森が 2024-10-11 07:24:47

我见过封闭式和半开放式两种版本在使用。由于您所说的原因,我更喜欢半开放。

在我看来,半开放版本它使预期行为更清晰并且“更安全”。谓词 ( a <= x < b ) 清楚地表明 b 位于区间之外。相反,如果您使用闭区间并在 SQL 中指定 (x BETWEEN a AND b),那么如果有人不明智地使用一行的结束日期作为下一行的开始,您将得到错误的答案。

将最晚结束日期默认为 DBMS 支持的最大日期而不是空。

I have seen both closed and half-open versions in use. I prefer half-open for the reasons you have stated.

In my opinion the half-open version it makes the intended behaviour clearer and is "safer". The predicate ( a <= x < b ) clearly shows that b is intended to be outside the interval. In contrast, if you use closed intervals and specify (x BETWEEN a AND b) in SQL then if someone unwisely uses the enddate of one row as the start of the next, you get the wrong answer.

Make the latest end date default to the largest date your DBMS supports rather than null.

桃扇骨 2024-10-11 07:24:47

总的来说,我同意大卫的回答,所以我不会重复该信息。除此之外:

您真的是指半开([StartDate,EndDate])

即使在“半开”中,也有两个错误。一个是直接的归一化错误,它当然会实现您在讨论中识别的重复数据,这些数据可作为派生数据使用,应该将其删除。

  • 对我来说,半开是 (StartDate)
  • EndDate 是从下一行派生的。
  • 这是最佳实践,
  • 这不是常见的用法,因为(a)现在常见的实现者没有意识到,(b)他们太懒了,或者不知道如何编写必要的简单子查询,
  • 这是基于经验的,在很大程度上银行数据库

有关详细信息,请参阅此:

链接到最近的非常相似的数据库问题与数据模型

评论回复

您似乎显然更喜欢带有自然、有意义的按键的规范化设计。在报告数据仓库中是否有必要偏离这一点?我的理解是,用于代理键和重复列(例如 EndDate)的额外空间是提高查询性能的权衡。然而,您关于缓存利用率和增加磁盘 IO 的一些评论让我对此提出质疑。我对您对此的意见非常感兴趣。

  1. 是的,绝对如此。任何理智的人(不是从维基百科学习计算机科学的人)都应该质疑这一点。它根本违背了物理定律。

  2. 你能理解吗,很多人在不了解规范化或数据库(你需要 5NF)的情况下,会产生非规范化的慢速数据堆,而他们著名的借口(由“专家”撰写)是“为了性能而非规范化”?现在你知道那是排泄物了。

  3. 这些人,在不了解规范化或数据仓库(您需要 6NF)的情况下,(a) 创建数据库的副本,以及 (b) 各种奇怪而奇妙的结构来“增强”查询,包括 (c) 更多的重复。猜猜他们的借口是什么? “为了性能而进行非规范化”。

  4. 简单的事实(对于那些用 (1) (2) (3) 来证明数据仓库合理性的人来说还不够复杂),正确执行的 6NF 就是数据仓库。我以仓库速度提供来自相同数据的数据库和数据仓库。没有第二个系统;没有第二个平台;没有副本;无 ETL;不保持副本同步;用户无需访问两个来源。当然,这需要技巧和对性能的理解,以及一些特殊的代码来克服 SQL 的限制(你不能在 DDL 中指定 6NF,你需要实现一个目录)。

  • 当纯标准化结构已经具有完整的维度事实功能时,为什么要实现 StarSchema 或 SnowFlake?
  1. 即使您没有这样做,如果您只是做传统的事情并将该数据库 ETL 到一个单独的数据仓库系统上,在其中消除重复,减少行大小,减少索引,当然它会运行得更快。否则,就违反了物理定律:胖子比瘦子跑得快;胖子比瘦子跑得快;胖子比瘦子跑得快。牛比马跑得快。
  • 公平地说,如果您没有标准化结构,那么请提供帮助。因此,他们提出了 StarSchemas、SnowFlakes 和各种 Dimension-Fact 设计。

请理解,只有不合格、没有经验的人才会相信所有这些神话和魔法。受过教育、有经验的人都有来之不易的真理,他们不会聘请巫医。那些“大师”只证实胖子赢得比赛并不是因为天气或星星;任何但是能解决问题的东西。有几个人内裤打结了,因为我直接,我告诉胖子减肥;但他们感到不安的真正原因是,我打破了他们珍视的神话,这些神话让他们有理由发胖。人们不喜欢改变。

  • 有一件事。 是否有必要偏离。规则不是非黑即白的;它们并不是孤立的单一规则。一个有思想的人必须把所有这些都放在一起考虑;根据上下文优先考虑它们。您在我的数据库中既找不到所有 Id 键,也找不到零个 Id 键,但每个 Id 键都经过仔细考虑和论证。

  • 无论如何,使用尽可能短的键,但使用有意义的关系键而不是代理;当钥匙太大而无法携带时,请使用代理。

  • 但永远不要从代理人开始。这严重影响了你理解数据的能力;正常化;对数据进行建模。

    • 这是一个问题/答案(很多!)这个人被困在这个过程中,甚至无法识别基本的实体和关系,因为他从一开始就把Id键粘在了所有东西上。在第一次迭代中,问题无需讨论就已解决。
      .


  • 好吧,还有一件事。学习这个主题,获得经验,并进一步提升自己。但不要试图教导它或改变他人,即使灯亮了,而且你很渴望。尤其是如果你充满热情的话。为什么 ?因为当你质疑巫医的建议时,整个村庄都会对你处以私刑,因为你正在攻击他们珍视的神话、他们的安慰;你需要我的经验来钉巫医(只需在评论中查看他的证据即可!)。给它几年的时间,获得真正来之不易的经验,然后接受它们。

如果您有兴趣,请关注此问题/答案 在接下来的几天里,这将是一个很好的例子,说明如何遵循 IDEF1X 方法、如何公开和提炼这些标识符。

Generally I agree with David's answer, so I won't repeat that info. Further to that:

Did you really mean half open ([StartDate,EndDate])

Even in that "half-open", there are two errors. One is a straight Normalisation error that of course implements duplicate data that you identify in the discussion, that is available as derived data, and that should be removed.

  • To me, Half Open is (StartDate)
  • EndDate is derived from the next row.
  • it is best practice
  • it is not common usage, because (a) common implementors are unaware these days and (b) they are too lazy, or don't know how, to code the necessary simple subquery
  • it is based on experience, in large banking databases

Refer to this for details:

Link to Recent Very Similar Question & Data Model

Responses to Comments

You seem to clearly favour normalised designs with natural, meaningful keys. Is it ever warranted to deviate from this in a reporting data warehouse? My understanding is that the extra space devoted to surrogate keys and duplicate columns (eg EndDate) are a trade off for increased query performance. However some of your comments about cache utilisation and increased disk IO make me question this. I would be very interested in your input on this.

  1. Yes, absolutely. Any sane person (who is not learning Computer Science from Wikipedia) should question that. It simply defies the laws of physics.

  2. Can you understand that many people, without understanding Normalisation or databases (you need 5NF), produce Unnormalised slow data heaps, and their famous excuse (written up by "gurus") is "denormalised for performance" ? Now you know that is excreta.

  3. Those same people, without understanding Normalisation or datawarehouses (you need 6NF), (a) create a copy of the database and (b) all manner of weird and wonderful structures to "enhance" queries, including (c) even more duplication. And guess what their excuse is ? "denormalised for performance".

  4. The simple truth (not complex enough for people who justify datawarehouses with (1) (2) (3) ), is that 6NF, executed properly, is the data warehouse. I provide both database and data warehouse from the same data, at warehouse speeds. No second system; no second platform; no copies; no ETL; no keeping copies synchronised; no users having to go to two sources. Sure, it takes skill and an understanding of performance, and a bit of special code to overcome the limitations of SQL (you cannot specify 6NF in DDL, you need to implement a catalogue).

  • why implement a StarSchema or a SnowFlake, when the pure Normalised structure already has full Dimension-Fact capability.
  1. Even if you did not do that, if you just did the traditional thing and ETLed that database onto a separate datawarehouse system, within it, if you eliminated duplication, reduced row size, reduced Indices, of course it would run faster. Otherwise, it defies the laws of physics: fat people would run faster than thin people; a cow would run faster than a horse.
  • fair enough, if you don't have a Normalised structure, then anything, please, to help. So they come up with StarSchemas, SnowFlakes and all manner of Dimension-Fact designs.

And please understand, only un_qualified, in_experienced people believe all these myths and magic. Educated experienced people have their hard-earned truths, they do not hire witch doctors. Those "gurus" only validate that the fat person doesn't win the race because of the weather, or the stars; anything but the thing that will solve the problem. A few people get their knickers in a knot because I am direct, I tell the fat person to shed weight; but the real reason they get upset is, I puncture their cherished myths, that keep them justified being fat. People do not like to change.

  • One thing. Is it ever warranted to deviate. The rules are not black-or-white; they are not single rules in isolation. A thinking person has to consider all of them together; prioritise them for the context. You will find neither all Id keys, nor zero Id keys in my databases, but every Id key has been carefully considered and justified.

  • By all means, use the shortest possible keys, but use meaningful Relational ones over Surrogates; and use Surrogates when the key becomes too large to carry.

  • But never start out with Surrogates. This seriously hampers your ability to understand the data; Normalise; model the data.

    • Here is one question/answer (of many!) where the person was stuck in the process, unable to identify even the basic Entities and Relations, because he had stuck Id keys on everything at the start. Problem solved without discussion, in the first iteration.
      .
  • Ok, another thing. Learn this subject, get experience, and further yourself. But do not try to teach it or convert others, even if the lights went on, and you are eager. Especially if you are enthusiastic. Why ? Because when you question a witch doctor's advice, the whole village will lynch you because you are attacking their cherished myths, their comfort; and you need my kind of experience to nail witch doctors (just check for evidence of his in the comments!). Give it a few years, get your real hard-won experience, and then take them on.

If you are interested, follow this question/answer for a few days, it will be a great example of how to follow IDEF1X methodology, how to expose and distil those Identifiers.

随遇而安 2024-10-11 07:24:47

好吧,标准 sql where my_field between date1 and date2 是包容性的,所以我更喜欢包容性的形式——并不是说另一个是错误的。

问题是,对于通常的 DW 查询,这些 (rowValidFrom, rowValidTo) 字段大多根本不使用,因为事实表中的外键已经指向维度表中的相应行。

这些在加载过程中最需要(我们在这里谈论类型 2 SCD),以查找匹配业务键的最新主键。此时您会看到类似的内容:

select ProductKey
from dimProduct
where ProductName = 'unique_name_of_some_product'
  and rowValidTo > current_date ;

或者,如果您更喜欢在加载之前创建密钥管道:

insert into keys_dimProduct (ProductName, ProductKey)  -- here ProductName is PK
select ProductName, ProductKey 
from dimProduct
where rowValidTo > current_date ;

这有助于加载,因为在加载之前很容易将密钥表缓存到内存中。例如,如果 ProductName 为 varchar(40),ProductKey 为整数,则键表每 1000 万行小于 0.5 GB,易于缓存以供查找。

其他常见的变体包括 rowIsCurrent = 'yes' 和 rowValidTo 为 null 的情况。

通常,使用以下一个或多个字段:

  • rowValidFrom
  • rowValidTo
  • rowIsCurrent
  • rowVersion

取决于 DW 设计者,有时还取决于使用的 ETL 工具,因为大多数工具具有 SCD 类型 2 加载块。

似乎有人担心拥有额外字段所使用的空间——因此,如果没有其他原因的话,我将在这里估计在维度表中使用一些额外空间的成本。

假设我使用所有 row_ 字段。

rowValidFrom date       = 3 bytes
rowValidTo   date       = 3 bytes
rowIsCurrent varchar(3) = 5 bytes
rowVersion   integer    = 4 bytes

总共 15 个字节。有人可能会说这 9 个甚至 12 个字节太多了——好吧。

对于 1000 万行,这相当于 150,000,000 字节 ~ 0.14GB

我从戴尔网站查找了价格。

Memory ~ $38/GB
Disk   ~ $80/TB = 0.078 $/GB 

我假设这里是 raid 5(三个驱动器),因此磁盘价格将为 0.078 $/GB * 3 = 0.23 $/GB

因此,对于 1000 万行,在磁盘上存储这 4 个字段的成本 0.23 $/GB * 0.14 GB = 0.032 $。如果将整个维度表缓存到内存中,这些字段的价格将为每 1000 万行38 $/GB * 0.14GB = 5.32 $。相比之下,我当地酒吧的啤酒价格约为 7 美元。

今年是 2010 年,我确实希望我的下一台笔记本电脑具有 16GB 内存。事物和(最佳)实践会随着时间而变化。

编辑:

做了一些搜索,在过去的15年中,普通计算机的磁盘容量增加了大约1000倍,内存大约增加了250倍。

Well, the standard sql where my_field between date1 and date2 is inclusive, so I prefer the inclusive form -- not that the other one is wrong.

The thing is that for usual DW queries, these (rowValidFrom, rowValidTo) fields are mostly not used at all because the foreign key in a fact table already points to the appropriate row in the dimension table.

These are mostly needed during loading (we are talking type 2 SCD here), to look-up the most current primary key for the matching business key. At that point you have something like:

select ProductKey
from dimProduct
where ProductName = 'unique_name_of_some_product'
  and rowValidTo > current_date ;

Or, if you prefer to create key-pipeline before loading:

insert into keys_dimProduct (ProductName, ProductKey)  -- here ProductName is PK
select ProductName, ProductKey 
from dimProduct
where rowValidTo > current_date ;

This helps loading, because it is easy to cache the key table into memory before loading. For example if ProductName is varchar(40) and ProductKey an integer, the key table is less than 0.5 GB per 10 million rows, easy to cache for lookup.

Other frequently seen variations include were rowIsCurrent = 'yes' and where rowValidTo is null.

In general, one or more of the following fields are used :

  • rowValidFrom
  • rowValidTo
  • rowIsCurrent
  • rowVersion

depending on a DW designer and sometimes ETL tool used, because most tools have a SCD type 2 loading blocks.

There seems to be a concern about the space used by having extra fields -- so, I will estimate here the cost of using some extra space in a dimension table, if for no other reason then convenience.

Suppose I use all of the row_ fields.

rowValidFrom date       = 3 bytes
rowValidTo   date       = 3 bytes
rowIsCurrent varchar(3) = 5 bytes
rowVersion   integer    = 4 bytes

This totals 15 bytes. One may argue that this is 9 or even 12 bytes too many -- OK.

For 10 million rows this amounts to 150,000,000 bytes ~ 0.14GB

I looked-up prices from a Dell site.

Memory ~ $38/GB
Disk   ~ $80/TB = 0.078 $/GB 

I will assume raid 5 here (three drives), so disk price will be 0.078 $/GB * 3 = 0.23 $/GB

So, for 10 million rows, to store these 4 fields on disk costs 0.23 $/GB * 0.14 GB = 0.032 $. If the whole dimension table is to be cached into memory, the price of these fields would be 38 $/GB * 0.14GB = 5.32 $ per 10 million rows. In comparison, a beer in my local pub costs ~ 7$.

The year is 2010, and I do expect my next laptop to have 16GB memory. Things and (best) practices change with time.

EDIT:

Did some searching, in the last 15 years, the disk capacity of an average computer increased about 1000 times, the memory about 250 times.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文