数据仓库中的事实表是否需要代理主键?
当我问我们的数据库设计者为什么我们的事实表没有 PK 时,我被告知即使选择了所有列,表中也没有一组可以唯一标识记录的列。 当我建议我们在这种情况下添加一个身份列时,我被告知“我只是在浪费空间,而且不需要它”。
我的感觉是,源系统中的每个表都应该有一个PK,即使它是一个标识列。 鉴于数据仓库 (DW) 是来自其他系统的数据的接收者,如果无法关联各个记录,我如何才能确保 DW 中的数据准确反映源系统中的内容? 如果您有一个失控的加载程序,该程序搞砸了数据并且已经运行了一周,您将如何协调与实时交易源系统的差异,而无需某种唯一的约束进行比较?
When I asked our DB designers why our Fact table do not have a PK, I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected. Whenb I suggested that we an an identity column in that case I was told that "I'd just be wasting space and that it wasn't needed."
My feeling is that every table in the source system should have a PK, even if it is an identity column. Given that the data warehouse (DW) is a recipient of data from other system-how would I otherwise be able to ensure that the data in the DW accurately reflects what is in the source system if there is no way to tie individual records? If you have a runaway load program that screws up data and has run for a week, how would you reconcile the differences with a live transaction source system w/o some sort of unique constraint to compare?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
数据仓库不一定是关系数据存储,尽管您可以选择将其设为关系数据存储,因此关系定义不一定适用。
仅当您想要对需要唯一标识符的数据执行某些操作时才需要主键(例如跟踪其源,但这并不总是必需的或必要的,甚至是可能的); 数据仓库中的数据通常可以不需要主键的方式使用。 具体来说,您可能不需要将行彼此区分开。 最常用于构建聚合值。
时间不是构建数据仓库表的必需维度。
这可能在心理上不舒服,浪费空间是一个微不足道的问题,但你的同事是对的 - PK 没有必要。
A data warehouse is not necessarily a relational data store, although you may choose to make it one, so relational definitions don't necessarily apply.
A primary key is only required if you want to do something with the data that requires a unique identifier (like trace it to a source, but that's not always required or necessary or even possible anyway); and data in a data warehouse can often be used in ways that don't require primary keys. Specifically, you may not need to distinguish rows from each other. Most often for constructing aggregate values.
Time is not a required dimension in constructing data warehouse tables.
It may be psychologically uncomfortable, and wasted space is a trivial issue, but your colleague is correct - PKs aren't necessary.
身份类型列是一个“代理”键,它替换您的“候选”键之一(简单地说)。 如果没有代理键列就无法识别行,则添加代理键列不会添加任何内容。 这需要候选密钥。
An identity type column is a "surrogate" key that replaces one of your "candidate" keys (simply put). Adding a surrogate key columns adds nothing if you can't identify a row without it. Which requires a candidate key.
如果事实表位于星型模式的中心,那么实际上存在一个候选键。 如果将事实表中的所有外键放在一起,即指向维度表中的行的外键,那就是候选键。
将其声明为主键可能没有多大好处。 它唯一能做的就是保护您免受流氓 ETL 过程的侵害。 运营仓库的人员可能已经掌握了 ETL 处理的情况。
就索引和查询速度而言,星型模式与面向 OLTP 的数据库的问题完全不同。 管理仓库的人可能也掌握着这一点。
在设计供 OLTP 使用的数据库时,使用没有主键的表是不明智的。 同样的考虑因素不会延续到仓库中。
If the fact table is at the center of a star schema, then there is in reality a candidate key. If you take all the foreign keys in the fact table together, the ones that point to rows in the dimension tables, that's a candidate key.
It probably would not do much good to declare it as a primary key. The only thing it would do is protect you against a rogue ETL process. The folks who run the warehouse might have the ETL processing well in hand.
As far as indexing and query speed is concerned, that's a whole different issue with star schemas than it is with OLTP oriented databases. The people who run the warehouse may have that in hand as well.
When designing a database for OLTP use, it's unwise to have a table without a primary key. The same considerations don't carry over into warehouses.
您至少应该在事实表上有一个自然键,以便您可以识别行并根据源协调它们或在必要时跟踪更改。
在 SQL Server 上,标识列免费为您提供代理键,而在使用序列的其他系统(例如 Oracle)上,可以相当轻松地添加它。 由于各种不同的原因,代理事实表键可能很有用。 一些可能的应用程序是:
一些工具喜欢在事实表上使用数字键,最好是单调递增的键。 MS SQL Server Analysis Services 就是一个这样的例子,它非常喜欢为用于填充度量值组的事实表提供一个单调递增的数字键。 这对于增量加载尤其需要。
如果事实表之间存在任何关系(例如,熟悉保险的人员所获得的书面保费明细),那么合成键在这里会很有帮助。
如果您的维度与事实表(例如 ICD 代码)存在 M:M 关系,则事实表上的数字键可以简化这一过程。
如果您对交易有任何自加入要求(例如某些交易是对其他交易的更正),那么合成密钥将简化这些交易的处理。
如果您在数据仓库中执行反重述操作(即通过生成反转和重述行来处理事务数据的更改),那么您可以为同一个自然键拥有多个事实表行。
否则,如果您没有任何内容以 1:M 关系连接到事实表,那么合成键可能不会用于任何用途。
You should at least have a natural key on the fact table so you can identify rows and reconcile them against source or track changes where this is necessary.
On SQL Server an identity column gives you a surrogate key for free and on other systems using sequences (e.g. Oracle) it can be added fairly easily. Surrogate fact table keys can be useful for various different reasons. Some possible applications are:
Some tools like to have numeric keys on fact tables, preferably monotonically increasing ones. An example of this is MS SQL Server Analysis Services, which really likes to have a numeric, monotonically increasing key for fact tables used to populate measure groups. This is especially required for incremental loads.
If you have any relationships between fact tables (for example a written - earned premium breakdown for those familiar with Insurance) then a synthetic key is helpful here.
If you have dimensions living in a M:M relationship with a fact table (e.g. ICD codes) then a numeric key on the fact table simplifies this.
If you have any self-join requirements for transactions (e.g. certain transactions being corrections to others) then a synthetic key will simplify working with these.
If you do contra-restate operations within your data warehouse (i.e. handle changes to transactional data by generating reversals and re-stating the row) then you can have multiple fact table rows for the same natural key.
Otherwise, if you won't have anything joining to your fact table in a 1:M relationship then a synthetic key probably won't be used for anything.
我同意你的看法。
“我被告知,即使选择了所有列,表中也没有一组列可以唯一标识一条记录。” - 据我了解,这似乎打破了关系数据库的一些基本原理。
事实由附加值加上维度的外键组成。 时间是一个明显的维度,对于我所知道的每个维度模型来说都是常见的。 如果不出意外的话,包含时间戳的复合键肯定是足够唯一的。
我想知道你们的 DBA 是否对维度建模有很多了解。 这是一种与正常的关系型、事务型风格不同的思维方式。
I would agree with you.
"I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected." - this seems to break something fundamental about relational databases as I understand them.
A fact consists of additive values plus foreign keys to dimensions. Time is an obvious dimension that is common to every dimensional model that I know. If nothing else, a composite key that contains timestamp would certainly be unique enough.
I wonder if your DBAs have much knowledge about dimensional modeling. It's a different way of thinking from the normal relational, transactional style.
你是对的——在某种程度上。 如果没有主键,表就不能满足关系的最低定义。 作为关系的基础是不允许出现重复的行。 数据仓库设计中的表应该是相关的,即使它们不是严格的规范形式。
因此,行中必须有一些列(或一组列)来唯一地标识行。 但它不一定是代理键的标识列。
如果事实表没有一组列可以起到候选键的作用,那么这个数据仓库中就需要更多的维度表,事实表中也需要更多的列。
这个新维度本身可能不是主键; 它可以与事实表中的现有列组合来创建候选键。
You are correct--sort of. Without a primary key, a table does not meet the minimal definition of being relational. It's fundamental to being a relation that it must not permit duplicate rows. Tables in a Data Warehouse design should be relational, even if they're not strictly in normal form.
So there must be some column (or set of columns) in the row that serve to identify rows uniquely. But it doesn't necessarily have to be an identity column for a surrogate key.
If the Fact Table has no set of columns that can serve this role of being a candidate key, then more Dimension Tables are needed in this DW, and more columns are needed in the Fact Table.
This new Dimension alone may not be the primary key; it may be combined with existing columns in the Fact Table to create a candidate key.
我一直认为表应该按照最常见的查询或性能影响因素进行排序,因此表的聚集索引应该与最困难或最常见的查询保持一致。
主键不一定是聚集索引,所以我知道您可能想知道我要做什么,但我更关心的是聚集索引而不是主键(老实说,它们通常相互跟随)。
所以对我来说最初的问题不是“我应该在事实表上有一个代理主键吗?” 但更像是“我应该在事实表上有聚集索引吗?” ,但我仍然认为这里值得一提,以防万一这是人们真正问的问题,尽管措辞错误)
我认为答案是肯定的,你应该有一个(是的,这个网站上还有其他帖子涵盖了这个问题 有时您需要代理键,但我衷心建议代理不是表的聚集索引。 这样做会根据无意义的代理键对表进行排序。 (通常人们会在表中添加代理标识列,并使其成为主键,默认情况下也是聚集索引)
那么在哪些列上创建聚集索引呢? 就我个人而言,我喜欢事实表的日期,为此,您可以添加一些其他维度的 FK 以实现唯一性,但这会增加大小,并且可能不会提供任何好处,因为为了使索引有用,必须引用相关维度(在生成密钥的重要性顺序)。
为了解决这个问题(以及我在这里回答这个问题的原因),我认为您应该添加一个代理项,然后在日期键上创建聚集索引,然后创建代理项(按该顺序)。 我这样做是因为单独的日期不会形成唯一的行,但添加替代项会形成唯一的行。 这使数据按日期排序,这有助于所有其他非聚集索引,并保持聚集索引大小合理。
此外,随着数据的增长,您可能需要对其进行分区,在这种情况下,您将需要一个始终为日期的分区键。 使用日期作为键的主要部分构建聚集索引使这变得更容易。 通过分区,您现在可以使用滑动窗口技术来存档旧数据或加载。
I always think that a table should be ordered by its most common queries or performance hitters, therefore the clustered index of a table should be in line with the most difficult or common query.
The primary key does not have to be a clustered index so I know you might be wondering where I am going with this but my concern is more about the clustered index than the primary key (and let’s be honest, they normally follow each other).
So the initial question for me is not "should I have a surrogate primary key on the fact table?" but more like "should I have a clustered index on the fact table?" I think the answer is yes you should have one (and yes there are other posts on this site covering this question but I still think it’s worth mentioning in here just in case this is the question people are really asking despite wording it wrong)
There are times you want a surrogate key but I would heartedly recommend that the surrogate is NOT the table’s clustered index. Doing so would order the table in line with the meaningless surrogate key. (Often people add a surrogate identity column to a table and make it the primary key and also the clustered index by default)
So what columns to make the clustered index on? Personally I like date for fact tables and to this you might add some other dimension’s FK for uniqueness but this will increase size and possibly not provide any benefit as in order for the index to be useful the relevant dimensions would have to be referenced (in the order of importance that the key was generated with).
To get around this (and the reason I answer this here) I think you SHOULD add a surrogate and then create the clustered index on the date key and followed by the surrogate (in that order). I do this because the date alone is not going to make a unique row but adding the surrogate will. This keeps the data ordered by date which helps all other non-clustered indexes and also keeps the clustered index size reasonable.
Additionally as the data grows, you may want to partition it in which case you will need a partition key which will invariably be date. Building the clustered index with date as the primary part of key makes this easier. With partitioning you can now use sliding window technique to archive old data or in loading.
http://www.ralphkimball.com/html/controversies.html
:
寓言 事实表的主键由所有引用的维度外键组成。
事实:
事实表通常有 10 个或更多外键连接到维度表的主键。 然而,行唯一性通常只需要事实表外键引用的子集。 大多数事实表都有一个主键,该主键由外键的串联/复合子集组成。
http://www.ralphkimball.com/html/controversies.html
Fable:
The primary key of a fact table consists of all the referenced dimension foreign keys.
Fact:
A fact table often has 10 or more foreign keys joining to the dimension tables’ primary keys. However, only a subset of the fact table’s foreign key references is typically needed for row uniqueness. Most fact tables have a primary key that consists of a concatenated/composite subset of the foreign keys.
每行没有唯一的标识符比乍看起来更糟糕。 当然,它是不稳定的,并且很容易无意中删除一些行。
但性能也差很多。 每次您最终要求数据库获取带有
EmployeeType = 'Manager'
的员工行时,您都在进行字符串比较。 标识符更快更好。此外,存储很便宜,在这种情况下,我想对空间的影响将不到四分之一个百分点如果,作为一个数据仓库,您可能正在为 TB 级的数据进行设计。
Not having a unique identifier for each row is even worse than it first seems. Sure, it is precarious and it's easy to inadvertently delete some rows.
But performance is much worse too. Each time you end up asking the database to get you the rows for Employees with
EmployeeType = 'Manager'
you are doing a string comparison. Identifiers are just faster and better.Besides, storage is cheap and in this case I imagine the impact on space will be less than a quarter percentage point if that--as a data warehouse you are probably designing for terabytes of data.
没有主键的数据库表似乎是一个糟糕的设计选择,并且为不同类型的异常留出了很大的空间,即如何删除或更新此类表中的单个记录?
Database table without primary key seems like a poor design choice and making lots of room for different types of anomalies i.e. how would you delete or update single record in such table?
您在这里混淆了两个问题——识别事实表中的唯一记录,以及跟踪从源系统到事实表的记录。
在后一种情况下,源系统中的单个记录很可能具有多个事实表记录。 想象一下源系统记录,它表示资金从一个帐户转移到另一个帐户。 可能有两个事实表记录来表示这一点,一个用于借方帐户,一个用于贷方帐户。 此外,可能有多个事实记录来表示源系统记录在其生命周期的不同点的不同状态。
对于事实表上的主键问题,确实没有“正确”的答案。 您可能需要一些理想/基本的特征(例如,单个记录的身份可以在系统用户之间轻松通信,或者单个记录可以轻松删除或更新)。 然而,对于 Oracle 系统,ROWID 可以很好地实现这一点,只要它偶尔发生变化并不重要。
但实际上,维护单个合成密钥的开销非常小,因此您还是可以这样做。 您可能选择不对其建立索引,因为索引将比列本身消耗更多的资源。
You're conflating two issues here -- identifying a unique record in the fact table, and tracing records from the source system through to the fact table.
In the latter case it's quite possible for a single record in a source system to have multiple fact table records. Imagine a source system record that represents a transfer of funds from one account to another. There might be two fact table records to represent this, one for the debited account and one for the credited account. Furthermore there might be multiple fact records to represent different states of the source system records at different points in it's lifecycle.
For the issue of the primary key on the fact table, there's really not a "correct" answer. There are desirable/essential characteristics that you might want (for example for the identity of a single record to be communicated easily between users of the system, or for a single record to be deleted or updated easily). However for an Oracle system a ROWID might very well do for that as long as it doesn't matter if it occasionally changes.
Really though, there's so little overhead in maintaining a single synthetic key that you might as well do it anyway. You might choose not to index it, as the index is going to be a much larger resource consumer than the column itself.
使用维度代理键的组合作为事实表的主键并不在所有情况下都有效。 考虑存在三个维度 a、b 和 c 的情况。 在大多数设计中,我们通常有一个“未知”维度行,假设我总是为该行分配代理键-1。 我的事实表中可以轻松地有两行具有键 a=n1、b=n2 和 c=-1,即重复键,因为这两行没有获得维度 c 的有效值,因此都解析为未知行。
using the combination of dimension surrogate keys as the primary key of the fact table doesnt work in all cases. Consider the case where there are three dimensions a, b and c. In most designs we usually have a dimension row for the "unknown", assume i always assign this row the surrogate key of -1. I could easily have two rows in my fact table that have keys a=n1, b=n2 and c=-1, ie duplicate keys because the two rows have not got valid values for dimension c and so both resolve to the unknown row.