业界巨头已经就关系数据库中的时态数据主题撰写了整本书。 Codd 已经去世,但他的同事兼合著者 CJ Date 和最近的 H Darwen 在《第三宣言》中继续推进和完善关系模型。关于这个主题的开创性著作是《Temporal Data & Time》。 CJ Date、Hugh Darwen 和 Nikos 的关系模型 洛伦索斯。
有很多人发表关于计算机科学科目的意见和个人选择,就像他们在选择冰淇淋一样。这是因为他们没有接受过任何正式的培训,因此将他们的计算机科学任务视为地球上唯一遇到该问题并找到解决方案的人。基本上他们从头开始重新发明轮子,就好像不存在其他轮子一样。阅读技术材料(不包括维基百科和 MS 出版物)可以节省大量时间和精力。
如果您进行谷歌搜索,请注意也有一些非常糟糕的“书籍”。这些文章是由在冰淇淋店度过一生的博士们在 MS 和 Oracle 的旗帜下出版的。由于他们没有阅读和理解教科书,他们对问题的理解很肤浅,并发明了相当不正确的“解决方案”。然后他们继续提供大量的解决方案,不是针对时态数据,而是针对其“解决方案”中固有的大量问题。你将陷入已被识别且唯一的问题之中;并实施触发器和各种不必要的代码。任何免费提供的东西的价值都与您所支付的价格完全相同。
I do not see it as a preference or a personal choice. Computer Science is, well, a science, and we are programming machinery, not a sensitive child.
Re-inventing the Wheel
Entire books have been written on the subject of Temporal Data in Relational Databases, by giants of the industry. Codd has passed on, but his colleague and co-author C J Date, and recently H Darwen carry on the work of progressing and refining the Relational Model, in The Third Manifesto. The seminal book on the subject is Temporal Data & the Relational Model by C J Date, Hugh Darwen, and Nikos A Lorentzos.
There are many who post opinions and personal choices re CS subjects as if they were choosing ice cream. This is due to not having had any formal training, and thus treating their CS task as if they were the only person on the planet who had come across that problem, and found a solution. Basically they re-invent the wheel from scratch, as if there were no other wheels in existence. A lot of time and effort can be saved by reading technical material (that excludes Wikipedia and MS publications).
Buy a Modern Wheel
Temporal Data has been a problem that has been worked with by thousands of data modellers following the RM and trying to implement good solutions. Some of them are good and others not. But now we have the work of giants, seriously researched, and with solutions and prescribed treatment provided. As before, these will eventually be implemented in the SQL Standard. PostgreSQL already has a couple of the required functions (the authors are part of TTM).
Therefore we can take those solutions and prescriptions, which will be (a) future-proofed and (b) reliable (unlike the thousands of not-so-good Temporal databases that currently exist), rather than relying on either personal opinion, or popular votes on some web-site. Needless to say, the code will be much easier as well.
Inspect Before Purchase
If you do some googling, beware that there are also really bad "books" available. These are published under the banner of MS and Oracle, by PhDs who spend their lives at the ice cream parlour. Because they did not read and understand the textbooks, they have a shallow understanding of the problem, and invent quite incorrect "solutions". Then they proceed to provide massive solutions, not to Temporal data, but to the massive problems inherent in their "solutions". You will be locked into problems that have been identified and sole; and into implementing triggers and all sorts of unnecessary code. Anything available free is worth exactly the price you paid for it.
Temporal Data
So I will try to simplify the Temporal problem, and paraphrase the guidance from the textbook, for the scope of your question. Simple rules, taking both Normalisation and Temporal requirements into account, as well as usage that you have not foreseen.
First and foremost, use the correct Datatype for any kind of Temporal column. That means DATETIME or SMALLDATETIME, depending on the resolution and range that you require. Where only DATE or TIME portion is required , you can use that. This allows you to perform date & time arithmetic using SQL function, directly in your WHERE clause.
Second, make sure that you use really clear names for the columns and variables.
There are three types of Temporal Data. It is all about categorising the properly, so that the treatment (planned and unplanned) is easy (which is why yours is a good question, and why I provide a full explanation). The advantage is much simpler SQL using inline Date/Time functions (you do not need the planned Temporal SQL functions). Always store:
Instant as SMALL/DATETIME, eg. UpdatedDtm
Interval as INTEGER, clearly identifying the Unit in the column name, eg. IntervalSec or NumDays
There are some technicians who argue that Interval should be stored in DATETIME, regardless of the component being used, as (eg) seconds or months since midnight 01 Jan 1900, etc. That is fine, but requires more unwieldy (not complex) code both in the initial storage and whenever it is extracted.
whatever you choose, be consistent.
Period or Duration. This is defined as the time period between two separate Instants. Storage depends on whether the Period is conjunct or disjunct.
For conjunct Periods, as in your Event requirement: use one SMALL/DATETIME for EventDateTime; the end of the Period can be derived from the beginning of the Period of the next row, and EndDateTime should not be stored.
For disjunct Periods, with gaps in-between yes, you need 2 x SMALL/DATETIMEs, eg. a RentedFrom and a RentedTo. If it is in the same row.
Period or Duration across rows merely need the ending Instant to be stored in some other row. ExerciseStart is the Event.DateTime of the X1 Event row, and ExerciseEnd is the Event.DateTime of the X9 Event row.
Therefore Period or Duration stored as an Interval is simply incorrect, not subject to opinion.
Data Duplication
Separately, in a Normalised database, ie. where EndDateTime is not stored (unless disjoint, as per above), storing a datum that can be derived will introduce an Update Anomaly where there was none.
with one EndDateTime, you have version of a the truth in one place; where as with duplicated data, you have a second version of the fact in another column:
which breaks 1NF
the two facts need to be maintained (updated) together, transactionally, and are at the risk of being out of synch
different queries could yeild different results, due to two versions of the truth
All easily avoided by maintaining the science. The return (insignificant increase in speed of single query) is not worth destroying the integrity of the data for.
Response to Comments
could you expand a little bit on the practical difference between conjunct and disjunct and the direct practical effect of these concepts on db design? (as I understand the difference, the exercise and temp-basal in my database are disjunct because they are distinct events separated by whitespace.. whereas basal itself would be conjunct because there's always a value)
Not quite. In your Db (as far as I understand it so far):
All the Events are Instants, not conjunct or disjunct Periods
The exceptions are Exercise and TempBasal, for which the ending Instant is stored, and therefore they have Periods, with whitespace between the Periods; thus they are disjunct.
I think you want to identify more Durations, such a ActiveInsulinPeriod and ActiveCarbPeriod, etc, but so far they only have an Event (Instant) that is causative.
I don't think you have any conjunct Periods (there may well be, but I am hard pressed to identify any. I retract what I said (When they were Readings, they looked conjunct, but we have progressed).
For a simple example of conjunct Periods, that we can work with re practical effect, please refer to this time-series question. The text and perhaps the code may be of value, so I have linked the Q/A, but I particularly want you the look at the Data Model. Ignore the three implementation options, they are irrelevant to this context.
Every Period in that database is Conjunct. A Product is always in some Status. The End-DateTime of any Period is the Start-DateTime of the next row for the Product.
It entirely depends on what you want to do with the data. As you say, you can filter by end time if you store that. On the other hand, if you want to find "all events lasting more than an hour" then the duration would be most useful.
Of course, you could always store both if necessary.
The important thing is: do you know how you're going to want to use the data?
EDIT: Just to add a little more meat, depending on the database you're using, you may wish to consider using a view: store only (say) the start time and duration, but have a view which exposes the start time, duration and computed end time. If you need to query against all three columns (whether together or separately) you'll want to check what support your database has for indexing a view column. This has the benefits of convenience and clarity, but without the downside of data redundancy (having to keep the "spare" column in sync with the other two). On the other hand, it's more complicated and requires more support from your database.
发布评论
评论(3)
我不认为这是一种偏好或个人选择。计算机科学是一门科学,而我们是编程机器,而不是敏感的孩子。
重新发明轮子
业界巨头已经就关系数据库中的时态数据主题撰写了整本书。 Codd 已经去世,但他的同事兼合著者 CJ Date 和最近的 H Darwen 在《第三宣言》中继续推进和完善关系模型。关于这个主题的开创性著作是《Temporal Data & Time》。 CJ Date、Hugh Darwen 和 Nikos 的关系模型
洛伦索斯。
有很多人发表关于计算机科学科目的意见和个人选择,就像他们在选择冰淇淋一样。这是因为他们没有接受过任何正式的培训,因此将他们的计算机科学任务视为地球上唯一遇到该问题并找到解决方案的人。基本上他们从头开始重新发明轮子,就好像不存在其他轮子一样。阅读技术材料(不包括维基百科和 MS 出版物)可以节省大量时间和精力。
购买现代轮
时态数据一直是成千上万遵循 RM 的数据建模者所面临的问题,并试图实施良好的解决方案。其中一些是好的,另一些则不是。但现在我们有了巨人的工作,经过认真研究,并提供了解决方案和处方治疗。和以前一样,这些最终将在 SQL 标准中实现。 PostgreSQL 已经有几个必需的函数(作者是 TTM 的成员)。
因此,我们可以采取这些解决方案和处方,它们(a)面向未来并且(b)可靠(与当前存在的数千个不太好的时态数据库不同),而不是依赖个人意见或流行的观点在某些网站上投票。不用说,代码也会容易得多。
购买前检查
如果您进行谷歌搜索,请注意也有一些非常糟糕的“书籍”。这些文章是由在冰淇淋店度过一生的博士们在 MS 和 Oracle 的旗帜下出版的。由于他们没有阅读和理解教科书,他们对问题的理解很肤浅,并发明了相当不正确的“解决方案”。然后他们继续提供大量的解决方案,不是针对时态数据,而是针对其“解决方案”中固有的大量问题。你将陷入已被识别且唯一的问题之中;并实施触发器和各种不必要的代码。任何免费提供的东西的价值都与您所支付的价格完全相同。
时态数据
因此,我将尝试简化时态问题,并根据您的问题范围解释教科书的指导。简单的规则,同时考虑标准化和时间要求,以及您未预见到的用法。
首先,对任何类型的时间列使用正确的数据类型。这意味着 DATETIME 或 SMALLDATETIME,具体取决于您需要的分辨率和范围。如果只需要 DATE 或 TIME 部分,您可以使用它。这允许您执行日期&直接在 WHERE 子句中使用 SQL 函数进行时间算术。
其次,确保为列和变量使用真正清晰的名称。
时态数据分为三种类型。这一切都是为了正确分类,以便治疗(计划内和计划外)变得容易(这就是为什么你的问题是一个好问题,也是为什么我提供完整的解释)。优点是使用内联日期/时间函数的 SQL 更加简单(您不需要计划的时态 SQL 函数)。始终存储:
即时为 SMALL/DATETIME,例如。 UpdatedDtm
间隔为INTEGER,清楚地标识列名称中的单位,例如。
IntervalSec
或NumDays
有些技术人员认为,无论使用什么组件,Interval 都应该存储在 DATETIME 中,例如(例如)自午夜以来的秒数或月数1900 年 1 月 1 日等。这很好,但在初始存储和提取时都需要更多笨拙(不复杂)的代码。
无论你选择什么,都要保持一致。
期间或持续时间。这被定义为两个单独瞬间之间的时间段。存储取决于期间是连续的还是分离的。
对于联合周期,
如您的事件要求:对EventDateTime
使用一个 SMALL/DATETIME;周期的结束时间可以从下一行的周期开始处推导出来,并且EndDateTime
不应被存储。对于分离的周期,中间有间隙,是的,您需要 2 x SMALL/DATETIME,例如。一个
RentedFrom
和一个RentedTo
。如果在同一行。跨行的周期或持续时间只需要将结束时刻存储在其他行中。 ExerceStart 是
X1 Event
行的Event.DateTime
,ExerciseEnd 是X9 Event
的Event.DateTime
> 行。因此,作为时间间隔存储的期间或持续时间根本不正确,不受意见影响。
数据重复
单独地,在规范化数据库中,即。在未存储
EndDateTime
的情况下(除非如上所述不相交),存储可导出的数据将在没有存储的情况下引入更新异常。通过一个
EndDateTime
,您可以在一处获得真相的版本;与重复数据一样,您在另一列中有第二个版本的事实:这打破了 1NF
这两个事实需要一起维护(更新),并且存在不同步的风险
由于事实的两个版本,不同的查询可能会产生不同的结果
通过维护科学,所有这些都可以轻松避免。所带来的回报(单次查询速度的微不足道的提升)不值得为了破坏数据的完整性而付出代价。
对评论的回应
您能否详细说明一下合取和析取之间的实际差异以及这些概念对数据库设计的直接实际影响? (据我了解差异,我的数据库中的练习和临时基础是分离的,因为它们是由空格分隔的不同事件..而基础本身将是联合的,因为总是有一个值)
不完全是。在您的数据库中(据我到目前为止的理解):
所有事件都是瞬间,而不是连续或分离的周期
Exercise 和 TempBasal 是例外,它们存储了结束瞬间,因此它们有周期,句点之间的空白;因此它们是分离的。
我认为您想识别更多持续时间,例如 ActiveInsulinPeriod 和 ActiveCarbPeriod 等,但到目前为止,它们只有一个事件(即时)是致病的。
我不认为你有任何连续的句号(很可能有,但我很难确定有什么。我收回我所说的话(当它们是阅读时,它们看起来是连续的,但我们已经取得了进展)。< /p>
对于一个简单的连接句点示例,我们可以使用它来实现实际效果,请参考这个时间序列问题。文本和代码可能很有价值,所以我链接了 Q/A,但我特别希望您查看数据忽略这三个实现选项,它们与此上下文无关。
该数据库中的每个周期都是联合。产品始终处于某种状态。任何期间的结束日期时间都是产品下一个行的开始日期时间。
I do not see it as a preference or a personal choice. Computer Science is, well, a science, and we are programming machinery, not a sensitive child.
Re-inventing the Wheel
Entire books have been written on the subject of Temporal Data in Relational Databases, by giants of the industry. Codd has passed on, but his colleague and co-author C J Date, and recently H Darwen carry on the work of progressing and refining the Relational Model, in The Third Manifesto. The seminal book on the subject is Temporal Data & the Relational Model by C J Date, Hugh Darwen, and Nikos
A Lorentzos.
There are many who post opinions and personal choices re CS subjects as if they were choosing ice cream. This is due to not having had any formal training, and thus treating their CS task as if they were the only person on the planet who had come across that problem, and found a solution. Basically they re-invent the wheel from scratch, as if there were no other wheels in existence. A lot of time and effort can be saved by reading technical material (that excludes Wikipedia and MS publications).
Buy a Modern Wheel
Temporal Data has been a problem that has been worked with by thousands of data modellers following the RM and trying to implement good solutions. Some of them are good and others not. But now we have the work of giants, seriously researched, and with solutions and prescribed treatment provided. As before, these will eventually be implemented in the SQL Standard. PostgreSQL already has a couple of the required functions (the authors are part of TTM).
Therefore we can take those solutions and prescriptions, which will be (a) future-proofed and (b) reliable (unlike the thousands of not-so-good Temporal databases that currently exist), rather than relying on either personal opinion, or popular votes on some web-site. Needless to say, the code will be much easier as well.
Inspect Before Purchase
If you do some googling, beware that there are also really bad "books" available. These are published under the banner of MS and Oracle, by PhDs who spend their lives at the ice cream parlour. Because they did not read and understand the textbooks, they have a shallow understanding of the problem, and invent quite incorrect "solutions". Then they proceed to provide massive solutions, not to Temporal data, but to the massive problems inherent in their "solutions". You will be locked into problems that have been identified and sole; and into implementing triggers and all sorts of unnecessary code. Anything available free is worth exactly the price you paid for it.
Temporal Data
So I will try to simplify the Temporal problem, and paraphrase the guidance from the textbook, for the scope of your question. Simple rules, taking both Normalisation and Temporal requirements into account, as well as usage that you have not foreseen.
First and foremost, use the correct Datatype for any kind of Temporal column. That means DATETIME or SMALLDATETIME, depending on the resolution and range that you require. Where only DATE or TIME portion is required , you can use that. This allows you to perform date & time arithmetic using SQL function, directly in your WHERE clause.
Second, make sure that you use really clear names for the columns and variables.
There are three types of Temporal Data. It is all about categorising the properly, so that the treatment (planned and unplanned) is easy (which is why yours is a good question, and why I provide a full explanation). The advantage is much simpler SQL using inline Date/Time functions (you do not need the planned Temporal SQL functions). Always store:
Instant as SMALL/DATETIME, eg. UpdatedDtm
Interval as INTEGER, clearly identifying the Unit in the column name, eg.
IntervalSec
orNumDays
There are some technicians who argue that Interval should be stored in DATETIME, regardless of the component being used, as (eg) seconds or months since midnight 01 Jan 1900, etc. That is fine, but requires more unwieldy (not complex) code both in the initial storage and whenever it is extracted.
whatever you choose, be consistent.
Period or Duration. This is defined as the time period between two separate Instants. Storage depends on whether the Period is conjunct or disjunct.
For conjunct Periods,
as in your Event requirement: use one SMALL/DATETIME forEventDateTime
; the end of the Period can be derived from the beginning of the Period of the next row, andEndDateTime
should not be stored.For disjunct Periods, with gaps in-between yes, you need 2 x SMALL/DATETIMEs, eg. a
RentedFrom
and aRentedTo
. If it is in the same row.Period or Duration across rows merely need the ending Instant to be stored in some other row. ExerciseStart is the
Event.DateTime
of theX1 Event
row, and ExerciseEnd is theEvent.DateTime
of theX9 Event
row.Therefore Period or Duration stored as an Interval is simply incorrect, not subject to opinion.
Data Duplication
Separately, in a Normalised database, ie. where
EndDateTime
is not stored (unless disjoint, as per above), storing a datum that can be derived will introduce an Update Anomaly where there was none.with one
EndDateTime
, you have version of a the truth in one place; where as with duplicated data, you have a second version of the fact in another column:which breaks 1NF
the two facts need to be maintained (updated) together, transactionally, and are at the risk of being out of synch
different queries could yeild different results, due to two versions of the truth
All easily avoided by maintaining the science. The return (insignificant increase in speed of single query) is not worth destroying the integrity of the data for.
Response to Comments
could you expand a little bit on the practical difference between conjunct and disjunct and the direct practical effect of these concepts on db design? (as I understand the difference, the exercise and temp-basal in my database are disjunct because they are distinct events separated by whitespace.. whereas basal itself would be conjunct because there's always a value)
Not quite. In your Db (as far as I understand it so far):
All the Events are Instants, not conjunct or disjunct Periods
The exceptions are Exercise and TempBasal, for which the ending Instant is stored, and therefore they have Periods, with whitespace between the Periods; thus they are disjunct.
I think you want to identify more Durations, such a ActiveInsulinPeriod and ActiveCarbPeriod, etc, but so far they only have an Event (Instant) that is causative.
I don't think you have any conjunct Periods (there may well be, but I am hard pressed to identify any. I retract what I said (When they were Readings, they looked conjunct, but we have progressed).
For a simple example of conjunct Periods, that we can work with re practical effect, please refer to this time-series question. The text and perhaps the code may be of value, so I have linked the Q/A, but I particularly want you the look at the Data Model. Ignore the three implementation options, they are irrelevant to this context.
Every Period in that database is Conjunct. A Product is always in some Status. The End-DateTime of any Period is the Start-DateTime of the next row for the Product.
这完全取决于您想如何处理数据。正如你所说,如果你存储它,你可以按结束时间进行过滤。另一方面,如果您想查找“所有持续超过一小时的事件”,那么持续时间将是最有用的。
当然,如果有必要,您可以随时存储两者。
重要的是:您知道如何使用这些数据吗?
编辑:只是为了添加更多内容,根据您使用的数据库,您可能希望考虑使用视图:仅存储(例如)开始时间和持续时间,但有一个公开开始时间、持续时间的视图和计算结束时间。如果您需要查询所有三列(无论是一起还是单独),您将需要检查数据库对视图列索引的支持。这样做的好处是方便和清晰,但没有数据冗余的缺点(必须保持“备用”列与其他两个列同步)。另一方面,它更复杂并且需要数据库更多的支持。
It entirely depends on what you want to do with the data. As you say, you can filter by end time if you store that. On the other hand, if you want to find "all events lasting more than an hour" then the duration would be most useful.
Of course, you could always store both if necessary.
The important thing is: do you know how you're going to want to use the data?
EDIT: Just to add a little more meat, depending on the database you're using, you may wish to consider using a view: store only (say) the start time and duration, but have a view which exposes the start time, duration and computed end time. If you need to query against all three columns (whether together or separately) you'll want to check what support your database has for indexing a view column. This has the benefits of convenience and clarity, but without the downside of data redundancy (having to keep the "spare" column in sync with the other two). On the other hand, it's more complicated and requires more support from your database.
结束 - 开始 = 持续时间。
有人可能会说您甚至可以使用“结束”和“持续时间”,因此任何组合实际上都没有区别。
除了您需要
包含的列来对其进行过滤
这一琐碎的事情之外,因此请包括End - Start = Duration.
One could argue you could even use End and Duration, so there really is no difference in any of the combinations.
Except for the triviality that you need
the column included to filter on it
, so include