如何为具有类型 2 缓慢变化维度的表建立索引以获得最佳性能
假设您有一个具有类型 2 缓慢变化维度的表。
让我们用以下列来表达该表:
* [Key]
* [Value1]
* ...
* [ValueN]
* [StartDate]
* [ExpiryDate]
在此示例中,我们假设 [StartDate] 实际上是系统已知给定 [Key] 的值的日期。因此我们的主键将由 [StartDate] 和 [Key] 组成。
当给定 [Key] 的一组新值到达时,我们将 [ExpiryDate] 分配给一些预定义的高代理值,例如“12/31/9999”。然后,我们将该 [Key] 的现有“最新”记录设置为具有等于新值的 [StartDate] 的 [ExpiryDate]。基于连接的简单更新。
因此,如果我们总是想获取给定[Key]的最新记录,我们知道我们可以创建一个聚集索引,即:
* [ExpiryDate] ASC
* [Key] ASC
尽管键空间可能非常宽(例如,一百万个键),但我们可以最大限度地减少读取之间的页面,最初按 [ExpiryDate] 排序。由于我们知道给定键的最新记录的 [ExpiryDate] 始终为“12/31/9999”,因此我们可以利用它来发挥我们的优势。
但是...如果我们想获取给定时间所有 [Key] 的时间点快照怎么办?理论上,整个密钥空间不会同时更新。因此,对于给定的时间点,[StartDate] 和 [ExpiryDate] 之间的窗口是可变的,因此按 [StartDate] 或 [ExpiryDate] 排序永远不会产生您要查找的所有记录都在其中的结果。连续的。当然,您可以立即丢弃 [StartDate] 大于您定义的时间点的所有记录。
本质上,在典型的 RDBMS 中,哪种索引策略提供了最佳方式来最大限度地减少读取给定时间点的所有键的值的次数?我意识到我至少可以通过按 [Key] 对表进行分区来最大化 IO,但这肯定不理想。
或者,是否有一种不同类型的缓慢变化的维度可以以更高效的方式解决这个问题?
Suppose you have a table with a Type 2 slowly-changing dimension.
Let's express this table as follows, with the following columns:
* [Key]
* [Value1]
* ...
* [ValueN]
* [StartDate]
* [ExpiryDate]
In this example, let's suppose that [StartDate] is effectively the date in which the values for a given [Key] become known to the system. So our primary key would be composed of both [StartDate] and [Key].
When a new set of values arrives for a given [Key], we assign [ExpiryDate] to some pre-defined high surrogate value such as '12/31/9999'. We then set the existing "most recent" records for that [Key] to have an [ExpiryDate] that is equal to the [StartDate] of the new value. A simple update based on a join.
So if we always wanted to get the most recent records for a given [Key], we know we could create a clustered index that is:
* [ExpiryDate] ASC
* [Key] ASC
Although the keyspace may be very wide (say, a million keys), we can minimize the number of pages between reads by initially ordering them by [ExpiryDate]. And since we know the most recent record for a given key will always have an [ExpiryDate] of '12/31/9999', we can use that to our advantage.
However... what if we want to get a point-in-time snapshot of all [Key]s at a given time? Theoretically, the entirety of the keyspace isn't all being updated at the same time. Therefore for a given point-in-time, the window between [StartDate] and [ExpiryDate] is variable, so ordering by either [StartDate] or [ExpiryDate] would never yield a result in which all the records you're looking for are contiguous. Granted, you can immediately throw out all records in which the [StartDate] is greater than your defined point-in-time.
In essence, in a typical RDBMS, what indexing strategy affords the best way to minimize the number of reads to retrieve the values for all keys for a given point-in-time? I realize I can at least maximize IO by partitioning the table by [Key], however this certainly isn't ideal.
Alternatively, is there a different type of slowly-changing-dimension that solves this problem in a more performant manner?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
懒惰的 DBA
您是在谈论恢复维度表中的所有值吗?如果是这样,那么为什么不添加具有额外覆盖范围的非聚集索引,这样您只从索引本身而不是从表中提取值呢?这样您就可以扫描带有一些附加“覆盖”值的 B 树,而不是潜在地执行表扫描?我不能保证相对性能,但值得针对您正在处理的场景进行测试。
欢呼
奥齐梅德斯
http://ozziemedes.blogspot.com/
Lazy DBA
Are you talking about bringing back all the values in your dimension table? If so, then why not add a non-clustered index with additional coverage such that you're only pulling values out of the index itself, rather than from the table? That way you're scanning a B-Tree with some attached "covered" values, as opposed to potentially performing a table scan? I can't vouch for relative performance, but it's worth testing for the scenario you're obviously working on.
Cheers
Ozziemedes
http://ozziemedes.blogspot.com/
如果这确实是一个“缓慢变化的维度”表,我会考虑聚集列存储索引。我知道当你问这个问题时,这个功能不可用,但无论如何。你会在这里找到一些很棒的文档:
“https://msdn.microsoft.com/en-us/library/gg492088。 .aspx"
在这里
“http://www.nikoport.com /2013/07/05/clustered-columnstore-indexes-part-1-intro/”。
现在,如果您想坚持使用行存储索引,如果您按顺序将数据插入表中,我过去所做的就是利用身份字段。您的查询将类似于:
If this is truly a "slowly changing dimension" table, I would consider a clustered columnstore index. I know this wasn't available when you asked the question, but anyway. you'll find some great documentation here:
"https://msdn.microsoft.com/en-us/library/gg492088.aspx"
and here
"http://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/".
now if you want to stick to rowstore indexes, if you're inserting the data in table sequentially, what I've done in the past was leveraging an identity field. your queries would be something like: