代理与自然键:性能差异的硬数据?
代理键和自然键之间存在着一场健康的争论:
我的观点似乎与大多数人(微弱多数)一致,即您应该使用代理键,除非自然键是完全明显的并且保证不会改变。 然后您应该强制自然键的唯一性。 这意味着几乎所有时候都需要代理键。
两种方法的示例,从公司表开始:
1:代理键:表有一个 ID 字段,它是 PK(和身份)。 州要求公司名称是唯一的,因此存在唯一约束。
2:自然键:表使用CompanyName和State作为PK——既满足PK又满足唯一性。
假设 Company PK 用于其他 10 个表。 我的假设(没有数字支持)是代理键方法在这里会快得多。
我见过的关于自然键的唯一令人信服的论据是对于使用两个外键作为自然键的多对多表。 我认为在这种情况下这是有道理的。 但如果您需要重构,您可能会遇到麻烦; 我认为这超出了本文的范围。
有没有人看过一篇文章,比较使用代理键的一组表与使用的同一组表的性能差异自然键? 环顾 SO 和 Google 并没有产生任何有价值的东西,只是大量的理论构建。
重要更新:我已经开始构建一组测试表来回答这个问题。 它看起来像这样:
- PartNatural - 使用的零件表 作为 PK PartSurrogate 的唯一 PartNumber
- - 零件表 使用ID(int,identity)作为PK并且 在 PartNumber
- Plant 上有一个唯一索引 - ID (int,identity) as PK
- Engineer - ID (int,identity)as PK
每个零件都连接到一个工厂,并且工厂中零件的每个实例都连接到一个工程师。 如果有人对此测试平台有疑问,现在就是时候了。
There's a healthy debate out there between surrogate and natural keys:
My opinion, which seems to be in line with the majority (it's a slim majority), is that you should use surrogate keys unless a natural key is completely obvious and guaranteed not to change. Then you should enforce uniqueness on the natural key. Which means surrogate keys almost all of the time.
Example of the two approaches, starting with a Company table:
1: Surrogate key: Table has an ID field which is the PK (and an identity). Company names are required to be unique by state, so there's a unique constraint there.
2: Natural key: Table uses CompanyName and State as the PK -- satisfies both the PK and uniqueness.
Let's say that the Company PK is used in 10 other tables. My hypothesis, with no numbers to back it up, is that the surrogate key approach would be much faster here.
The only convincing argument I've seen for natural key is for a many to many table that uses the two foreign keys as a natural key. I think in that case it makes sense. But you can get into trouble if you need to refactor; that's out of scope of this post I think.
Has anyone seen an article that compares performance differences on a set of tables that use surrogate keys vs. the same set of tables using natural keys? Looking around on SO and Google hasn't yielded anything worthwhile, just a lot of theorycrafting.
Important Update: I've started building a set of test tables that answer this question. It looks like this:
- PartNatural - parts table that uses
the unique PartNumber as a PK - PartSurrogate - parts table that
uses an ID (int, identity) as PK and
has a unique index on the PartNumber - Plant - ID (int, identity) as PK
- Engineer - ID (int, identity) as PK
Every part is joined to a plant and every instance of a part at a plant is joined to an engineer. If anyone has an issue with this testbed, now's the time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
两者都用! 自然键可以防止数据库损坏(“不一致”可能是一个更好的词)。 当“正确的”自然键(为了消除重复行)由于长度或涉及的列数而表现不佳时,出于性能目的,也可以添加代理键以用作其他表中的外键,而不是自然键...但是自然键应该保留作为备用键或唯一索引,以防止数据损坏并强制数据库一致性...
大部分的欢呼(在关于这个问题的“辩论”中)可能是由于什么是一个错误的假设 - 您必须在其他表中使用主键进行连接和外键。 这是错误的。 您可以使用任何键作为其他表中外键的目标。 它可以是主键、备用键或任何唯一索引或唯一约束,只要它在目标关系(表)中是唯一的。 至于连接,您可以使用任何东西作为连接条件,它甚至不必是键、索引,甚至不必是唯一的! (尽管如果它不是唯一的,您将在它创建的笛卡尔积中获得多行)。 您甚至可以使用非特定条件(如 >、< 或“like”)作为连接条件来创建连接。
事实上,您可以使用任何计算结果为布尔值的有效 SQL 表达式来创建连接。
Use both! Natural Keys prevent database corruption (inconsistency might be a better word). When the "right" natural key, (to eliminate duplicate rows) would perform badly because of length, or number of columns involved, for performance purposes, a surrogate key can be added as well to be used as foreign keys in other tables instead of the natural key... But the natural key should remain as an alternate key or unique index to prevent data corruption and enforce database consistency...
Much of the hoohah (in the "debate" on this issue), may be due to what is a false assumption - that you have to use the Primary Key for joins and Foreign Keys in other tables. THIS IS FALSE. You can use ANY key as the target for foreign keys in other tables. It can be the Primary Key, an alternate Key, or any unique index or unique constraint., as long as it is unique in the target relation (table). And as for joins, you can use anything at all for a join condition, it doesn't even have to be a key, or an index, or even unique !! (although if it is not unique you will get multiple rows in the Cartesian product it creates). You can even create a join using non-specific criterion (like >, <, or "like" as the join condition.
Indeed, you can create a join using any valid SQL expression that evaluate to a boolean.
自然键与代理键的不同之处在于值,而不是类型。
任何类型都可以用作代理键,例如用于系统生成的
slug
的VARCHAR
或其他类型。但是,代理键最常用的类型是 INTEGER 和 RAW(16) (或者您的 RDBMS 用作 GUID 的任何类型) /code>'s),
比较代理整数和自然整数(如
SSN
)所需的时间完全相同。比较 VARCHAR 会考虑排序规则,它们通常比整数长,这使得它们效率较低。
比较一组两个
INTEGER
的效率也可能低于比较单个INTEGER
的效率。对于较小的数据类型,这种差异可能是获取页面、遍历索引、获取数据库锁存器等所需时间的百分比。
以下是数字(在
MySQL
中) ):t_source
只是一个包含1,000,000
行的虚拟表。aint
和adouble
、bint
和bdouble
包含完全相同的数据,除了aint
有一个整数作为PRIMARY KEY
,而adouble
有一对两个相同的整数。在我的机器上,两个查询都运行 14.5 秒,+/- 0.1 秒。
性能差异(如果有)在波动范围内。
Natural keys differ from surrogate keys in value, not type.
Any type can be used for a surrogate key, like a
VARCHAR
for the system-generatedslug
or something else.However, most used types for surrogate keys are
INTEGER
andRAW(16)
(or whatever type yourRDBMS
does use forGUID
's),Comparing surrogate integers and natural integers (like
SSN
) takes exactly same time.Comparing
VARCHAR
s make take collation into account and they are generally longer than integers, that making them less efficient.Comparing a set of two
INTEGER
is probably also less efficient than comparing a singleINTEGER
.On datatypes small in size this difference is probably percents of percents of the time required to fetch pages, traverse indexes, acquite database latches etc.
And here are the numbers (in
MySQL
):t_source
is just a dummy table with1,000,000
rows.aint
andadouble
,bint
andbdouble
contain exactly same data, except thataint
has an integer as aPRIMARY KEY
, whileadouble
has a pair of two identical integers.On my machine, both queries run for 14.5 seconds, +/- 0.1 second
Performance difference, if any, is within the fluctuations range.