每个表真的都需要一个自增的人工主键吗?

发布于 2024-09-27 00:54:22 字数 234 浏览 6 评论 0原文

在我7年的开发经验中,我所见过的每个数据库中几乎每个表都有一个自增主键。这是为什么呢?如果我有一个美国各州的表,其中每个州都必须有唯一的名称,那么自动递增主键有什么用?为什么不直接使用州名作为主键呢?在我看来,这似乎是允许重复项伪装成唯一行的借口。

这对我来说似乎是显而易见的,但话又说回来,似乎没有人得出与我相同的逻辑结论并采取行动,所以我必须假设我很有可能是错的。

我们需要使用自动递增键是否有任何真正的、实际的原因?

Almost every table in every database I've seen in my 7 years of development experience has an auto-incrementing primary key. Why is this? If I have a table of U.S. states where each state where each state must have a unique name, what's the use of an auto-incrementing primary key? Why not just use the state name as the primary key? Seems to me like an excuse to allow duplicates disguised as unique rows.

This seems plainly obvious to me, but then again, no one else seems to be arriving at and acting on the same logical conclusion as me, so I must assume there's a good chance I'm wrong.

Is there any real, practical reason we need to use auto-incrementing keys?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

花开半夏魅人心 2024-10-04 00:54:22

这个问题已经在 SO 上被问过很多次,并且多年来一直是开发人员和 DBA 之间(以及之间)争论的主题。

首先我要说的是,您问题的前提意味着方法普遍优于其他方法……现实生活中很少出现这种情况。 代理键和自然键都有其用途和挑战 - 了解它们是什么很重要。无论您在系统中做出哪种选择,请记住一致性有好处 - 它使数据模型更易于理解并且更易于为其开发查询和应用程序。我还想说,对于 PK,我倾向于使用代理键而不是自然键……但这并不意味着自然键有时在该角色中不起作用。

重要的是要认识到代理键和自然键并不相互排斥 - 并且在许多情况下它们可以相互补充。请记住,数据库表的“键”只是唯一标识记录(行)的东西。单行完全有可能有多个键来表示使记录唯一的不同类别的约束。

另一方面,主键是一个特定的唯一键,数据库将使用它来强制引用完整性并表示其他表中的外键。任何表只能有一个主键。主键的基本品质是它100%唯一且非NULL。主键的理想品质是它是稳定的(不变)。虽然可变主键是可能的,但它们会给数据库带来许多最好避免的问题(级联更新、RI 故障等)。如果您确实选择对表使用代理主键 - 您还应该考虑创建唯一约束以反映任何自然键的存在。

代理键在以下情况下很有用:

  1. 自然键不稳定(值可能会随着时间的推移而变化)
  2. 自然键很大或难以处理(多列或长值)
  3. 自然键可能会随着时间的推移而变化(添加的列/随着时间的推移而被删除)

通过为每一行提供一个简短、稳定、唯一的值,我们可以减少数据库的大小,提高其性能,并减少存储外键的依赖表的波动性。还有关键的多态性的好处,我稍后会介绍。

在某些情况下,使用自然键表达表之间的关系可能会出现问题。例如,假设您有一个 PERSON 表,其自然键为 {LAST_NAME, FIRST_NAME, SSN}。如果您有其他一些表 GRANT_PROPOSAL,您需要在其中存储对提案者、审阅者、批准者和授权者的引用,会发生什么情况。您现在需要 12 列来表达此信息。您还需要提出某种命名约定来识别哪些列属于哪种个人。但是,如果您的 PERSON 表需要 6、8 或 24 列作为自然键怎么办?这很快就会变得难以管理。 代理键通过将键的语义(含义)与其作为标识符的用途分开来解决此类问题

我们还看一下您在问题中描述的示例。

是否应该使用状态的 2 个字符缩写作为该表的主键。

从表面上看,缩写字段似乎满足良好主键的要求。它相对较短,很容易作为外键传播,看起来很稳定。不幸的是,您无法控制缩写集……邮政服务可以控制。这里有一个有趣的事实:1973 年,美国邮政局将内布拉斯加州的缩写从 NB 更改为 NE,以尽量减少与加拿大新不伦瑞克省的混淆。 这个故事的寓意是自然键通常不受数据库的控制......并且它们可以随着时间的推移而改变。即使您认为他们不能。对于更复杂的数据(例如人员或产品等),这个问题更加明显。随着业务的发展,使这些实体独一无二的定义可能会发生变化。这可能会给数据建模者和应用程序开发人员带来严重的问题。

前面我提到主键可以支持键多态性。这是什么意思?嗯,多态性是一种类型 A 作为另一种类型 B 出现和使用的能力。在数据库中,这个概念是指将不同类别实体的键组合到单个表中的能力。让我们看一个例子。想象一下,您希望系统中有一个审计跟踪,以识别哪个用户在什么日期修改了哪些实体。最好创建一个包含以下字段的表:{ENTITY_ID, USER_ID, EDIT_DATE}。不幸的是,使用自然键,不同的实体具有不同的键。因此,现在我们需要为每种实体创建一个单独的链接表...并以以下方式构建我们的应用程序:它理解不同类型的实体以及它们的密钥是如何形成的。

别误会我的意思。我并不是提倡应该始终使用代理键。在现实世界中,永远、永远、永远都是一个危险的立场。代理键的最大缺点之一是它们可能会导致表的外键由大量“无意义”的数字组成。这可能会使解释记录的含义变得很麻烦,因为您必须连接或查找其他表中的记录才能获得完整的情况。它还可能使分布式数据库部署变得更加复杂,因为在服务器之间分配唯一的递增编号并不总是可能的(尽管大多数现代数据库(如 Oracle 和 SQLServer)通过序列复制缓解了这种情况)。

This question has been asked numerous times on SO and has been the subject of much debate over the years amongst (and between) developers and DBAs.

Let me start by saying that the premise of you question implies that one approach is universally superior to the other ... this is rarely the case in real life. Surrogate keys and natural keys both have their uses and challenges - and it's important to understand what they are. Whichever choice you make in your system, keep in mind there is benefit to consistency - it makes the data model easier to understand and easier to develop queries and applications for. I also want to say that I tend to prefer surrogate keys over natural keys for PKs ... but that doesn't mean that natural keys can't sometimes be useful in that role.

It is important to realize that surrogate and natural keys are NOT mutually exclusive - and in many cases they can complement each other. Keep in mind that a "key" for a database table is simply something that uniquely identifies a record (row). It's entirely possible for a single row to have multiple keys representing the different categories of constraints that make a record unique.

A primary key, on the other hand, is a particular unique key that the database will use to enforce referential integrity and to represent a foreign key in other tables. There can only be a single primary key for any table. The essential quality of a primary key is that it be 100% unique and non-NULL. A desirable quality of a primary key is that it be stable (unchanging). While mutable primary keys are possible - they cause many problems for database that are better avoided (cascading updates, RI failures, etc). If you do choose to use a surrogate primary key for your table(s) - you should also consider creating unique constraints to reflect the existence of any natural keys.

Surrogate keys are beneficial in cases where:

  1. Natural keys are not stable (values may change over time)
  2. Natural keys are large or unwieldy (multiple columns or long values)
  3. Natural keys can change over time (columns added/removed over time)

By providing a short, stable, unique value for every row, we can reduce the size of the database, improve its performance, and reduce the volatility of dependent tables which store foreign keys. There's also the benefit of key polymorphism, which I'll get to later.

In some instances, using natural keys to express relationships between tables can be problematic. For instance, imagine you had a PERSON table whose natural key was {LAST_NAME, FIRST_NAME, SSN}. What happens if you have some other table GRANT_PROPOSAL in which you need to store a reference to a Proposer, Reviewer, Approver, and Authorizer. You now need 12 columns to express this information. You also need to come up with a naming convention of some kind to identify which columns belong to which kind of individual. But what if your PERSON table required 6, or 8, or 24 columns to for a natural key? This rapidly becomes unmanageable. Surrogate keys resolve such problems by divorcing the semantics (meaning) of a key from its use as an identifier.

Let's also take a look at the example you described in your question.

Should the 2-character abbreviation of a state be used as the primary key of that table.

On the surface, it looks like the abbreviation field meets the requirements of a good primary key. It's relatively short, it is easy to propagate as a foreign key, it looks stable. Unfortunately, you don't control the set of abbreviations ... the postal service does. And here's an interesting fact: in 1973 the USPS changed the abbreviation of Nebraska from NB to NE to minimize confusion with New Brunswick, Canada. The moral of the story is that natural keys are often outside of the control of the database ... and they can change over time. Even when you think they cannot. This problem is even more pronounced for more complicated data like people, or products, etc. As businesses evolve, the definitions for what makes such entities unique can change. And this can create significant problems for data modelers and application developers.

Earlier I mentioned that primary keys can support key polymorphism. What does that mean? Well, polymorphism is the ability of one type, A, to appear as and be used like another type, B. In databases, this concept refers to the ability to combine keys from different classes of entities into a single table. Let's look at an example. Imagine for a moment that you want have an audit trail in your system that identifies which entities were modified by which user on what date. It would be nice to create a table with the fields: {ENTITY_ID, USER_ID, EDIT_DATE}. Unfortunately, using natural keys, different entities have different keys. So now we need to create a separate linking table for each kind of entity ... and build our application in a manner where it understand the different kinds of entities and how their keys are shaped.

Don't get me wrong. I'm not advocating that surrogate keys should ALWAYS be used. In the real world never, ever, and always are a dangerous position to adopt. One of the biggest drawbacks of surrogate keys is that they can result in tables that have foreign keys consisting of lots of "meaningless" numbers. This can make it cumbersome to interpret the meaning of a record since you have to join or lookup records from other tables to get a complete picture. It also can make a distributed database deployment more complicated, as assigning unique incrementing numbers across servers isn't always possible (although most modern database like Oracle and SQLServer mitigate this via sequence replication).

从来不烧饼 2024-10-04 00:54:22

不会。

在大多数情况下,拥有代理 INT IDENTITY 键是一个简单的选择:它可以保证为 NOT NULL 且 100% 唯一,这是许多“自然”键所不提供的 -姓名可以更改,SSN 和其他信息也可以更改。

对于州缩写和名称 - 如果有的话,我会使用两个字母的州缩写作为键。

主键必须

  • 唯一(100%保证!不仅仅是“几乎”唯一)
  • NON NULL

主键应该:

  • 尽可能稳定(不改变 - 或至少不要太频繁)

状态两个字母的代码肯定会提供这个 - 这可能是自然密钥的候选者。键也应该很小 - 4 字节的 INT 是完美的,两个字母的 CHAR(2) 列也一样。我不会使用 VARCHAR(100) 字段或类似的字段作为键 - 它太笨重,很可能会一直改变 - 不是一个好的关键候选者。

因此,虽然您不必拥有自动递增的“人工”(代理)主键,但它通常是一个不错的选择,因为没有自然发生的数据真正能够胜任主键的任务,并且您希望避免使用包含多个列的巨大主键 - 这些太笨重且效率低下。

No.

In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.

In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.

A primary key must be:

  • unique (100% guaranteed! Not just "almost" unique)
  • NON NULL

A primary key should be:

  • stable if ever possible (not change - or at least not too frequently)

State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.

So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.

墨落成白 2024-10-04 00:54:22

我认为在“主”键短语中使用“主”一词实际上是一种误导。

首先,使用“键”的定义,即“键”是表中必须唯一的属性或属性集,

然后,拥有任何键都可以满足多个通常相互不一致的目的。

用途 1. 用作与该父表有关系的子表中的一条或多条记录的连接条件。 (在这些子表中显式或隐式定义外键)
目的2.(相关)确保子记录在父表中必须有父记录(子表FK必须作为父表中的Key存在)
目的3。提高需要快速定位表中特定记录/行的查询的性能。

目的4。 (从数据一致性角度来看最重要!)通过防止将表示相同逻辑实体的重复行插入到表中来确保数据一致性。 (这通常称为“自然”键,并且应该由相对不变的表(实体)属性组成。)

显然,任何无意义的非自然键(例如 GUID 或自动生成的整数)都是 >完全无法满足目的4。

但通常,对于许多(大多数)表,可以提供#4的完全自然的键通常由多个属性组成并且太宽,或者如此广泛,以至于将其用于目的 #1、#2 或 #3 将导致不可接受的性能后果

答案很简单,为其他子表中的所有联接和 FK 使用简单的自动生成整数键。每个需要数据一致性的表(很少有表不需要)都有一个备用的自然唯一键,这将防止插入不一致的数据行...另外,如果您总是两者都有,那么所有反对使用自然键的反对意见(如果它改变了怎么办?我必须改变它被引用为 FK 的每个地方变得毫无意义,因为你没有使用它......你只在它是 PK 的一个表中使用它,以避免不一致重复数据...

唯一一次不需要两者都可以逃脱的情况是使用完全独立的表,该表不参与与其他表的关系,并且具有明显且可靠的自然键。

I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.

First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,

Then, having any key serves several often mutually inconsistent purposes.

Purpose 1. To use as joins conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)
Purpose 2. (related) Ensuring that child records must have a parent record in the parent table (The child table FK must exist as Key in the parent table)
Purpose 3. To increase performance of queries that need to rapidly locate a specific record/row in the table.

Purpose 4. (Most Important from data consistency perspective!) To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted itno the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)

Clearly, any non-meaningfull, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying Purpose 4.

But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequencecs.

The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplciate data...

The only time you can get away without both is for a completely stand alone table that participates in no relationships with other tables and has an obvious and reliable natural key.

街道布景 2024-10-04 00:54:22

一般来说,数字主键的性能比字符串更好。您还可以创建唯一的键来防止重复项的出现。这样您就可以保证没有重复项,但您还可以获得数字的性能(与场景中的字符串相比)。

很可能,主要数据库对基于整数的主键进行了一些性能优化,而基于字符串的主键则没有这些优化。但是,这只是一个合理的猜测。

In general, a numeric primary key will perform better than a string. You can additionaly create unique keys to prevent duplicates from creeping in. That way you get the assurance of no duplicates, but you also get the performance of numbers (vs. strings in your scenario).

In all likelyhood, the major databases have some performance optimizations for integer-based primary keys that are not present for string-based primary keys. But, that is only a reasonable guess.

故事灯 2024-10-04 00:54:22

是的,在我看来,每个表都需要一个自动递增的整数键,因为它使 JOIN 和(尤其是)前端编程变得非常非常容易。其他人有不同的感觉,但这是20多年的经验说话。

唯一的例外是小的“代码”或“查找”表,我愿意在其中替换短的(4 或 5 个字符)TEXT 代码值。我这样做是因为我经常在数据库中使用很多这些内容,它允许我向用户呈现有意义的显示,而无需在查找表中查找描述或将其连接到结果集中。您的 States 表示例适合此类别的。

Yes, in my opinion every table needs an auto incrementing integer key because it makes both JOINs and (especially) front-end programming much, much, much easier. Others feel differently, but this is over 20 years of experience speaking.

The single exception is small "code" or "lookup" tables in which I'm willing to substitute a short (4 or 5 character) TEXT code value. I do this because the I often use a lot of these in my databases and it allows me to present a meaningful display to the user without having to look up the description in the lookup table or JOIN it into a result set. Your example of a States table would fit in this category.

情愿 2024-10-04 00:54:22

不,绝对不是。

拥有一个不能更改的主键是一个好主意(UPDATE 对于主键列是合法的,但通常可能会造成混乱,并且可能会给子行带来问题)。但是,如果您的应用程序有比自动递增值更合适的其他候选值,那么您可能应该使用它。

从性能角度来看,一般来说列越少越好,尤其是索引越少越好。如果您有另一列具有唯一索引并且永远不会被任何业务流程更改,那么它可能是合适的主键。

从 MySQL (Innodb) 的角度来看,使用“真实”列作为主键而不是“人造”列也是一个好主意,因为 InnoDB 总是聚集主键并将其包含在二级索引中(这就是它找到其中的行)。这使得它有可能使用主键进行有用的优化,而使用任何其他唯一索引则无法做到这一点。 MSSQL 用户经常选择对主键进行聚簇,但也可以对不同的唯一索引进行聚簇。

编辑:

但是,如果它是一个小型数据库,并且您不太关心性能或大小,那么添加不必要的自动增量列并不是那么糟糕。

非自动递增值(例如 UUID 或根据您自己的算法生成的其他字符串)对于分布式、分片或多样化系统可能很有用,在这些系统中,维护一致的自动递增 ID 很困难(或不可能 - 考虑一个分布式系统)系统继续在网络分区的两侧插入行)。

No, absolutely not.

Having a primary key which can't change is a good idea (UPDATE is legal for primary key columns, but in general potentially confusing and can create problems for child rows). But if your application has some other candidate which is more suitable than an auto-incrementing value, then you should probably use that instead.

Performance-wise, in general fewer columns are better, and particularly fewer indexes. If you have another column which has a unique index on it AND can never be changed by any business process, then it may be a suitable primary key.

Speaking from a MySQL (Innodb) perspective, it's also a good idea to use a "real" column as a primary key rather than an "artificial" one, as InnoDB always clusters the primary key and includes it in secondary indexes (that is how it finds the rows in them). This gives it potential to do useful optimisation with a primary key which it can't with any other unique index. MSSQL users often choose to cluster the primary key, but it can also cluster a different unique index.

EDIT:

But if it's a small database and you don't really care about performance or size too much, adding an unnecessary auto-increment column isn't that bad.

A non auto-incrementing value (e.g. UUID, or some other string generated according to your own algorithm) may be useful for distributed, sharded, or diverse systems where maintaining a consistent auto-incrementing ID is difficult (or impossible - think of a distributed system which continues to insert rows on both sides of a network partition).

自由如风 2024-10-04 00:54:22

我认为有两件事可以解释为什么有时使用自动递增键的原因:

  • 空间考虑;好吧,你的州名并不多,但它所占用的空间可能会增加。如果您确实想以其名称作为主键来存储状态,那么就继续吧,但它会占用更多空间。在某些情况下,这可能不是问题,这听起来像是过去的问题,但这种习惯可能是根深蒂固的。而我们程序员和DBA确实喜欢习惯:D

  • 防御性考虑:我最近遇到了以下问题;我们的数据库中有用户,其中电子邮件是所有身份识别的关键。为什么不将电子邮件作为主密钥?除了突然出现的边界情况,一个人必须在那里两次才能拥有两个不同的地址,并且没有人在规范中讨论它,因此地址没有标准化,并且存在这样的情况:两封不同的电子邮件必须指向同一个人。 .. 过了一会儿,你就不再费劲了,而是添加了该死的整数 id 列

我并不是说这是一个坏习惯,也不是一个好习惯;我确信好的系统可以围绕合理的主键进行设计,但是这两点让我相信恐惧和习惯是罪魁祸首中的两个

I think there are two things that may explain the reason why auto-incrementing keys are sometimes used:

  • Space consideration; ok your state name doesn't amount to much, but the space it takes may add up. If you really want to store the state with its name as a primary key, then go ahead, but it will take more place. That may not be a problem in certain cases, and it sounds like a problem of olden days, but the habit is perhaps ingrained. And we programmers and DBA do love habits :D

  • Defensive consideration: i recently had the following problem; we have users in the database where the email is the key to all identification. Why not make the email the promary key? except suddenly border cases creep in where one guy must be there twice to have two different adresses, and nobody talked about it in the specs so the adress is not normalized, and there's this situation where two different emails must point to the same person and... After a while, you stop pulling your hairs out and add the damn integer id column

I'm not saying it's a bad habit, nor a good one; i'm sure good systems can be designed around reasonable primary keys, but these two points lead me to believe fear and habit are two among the culprits

千纸鹤带着心事 2024-10-04 00:54:22

它是关系数据库的关键组件。使用与状态相关的整数而不是使用整个状态名称可以节省数据库中的大量空间!想象一下,您有一百万条记录引用您的状态表。您想为每条记录上的数字使用 4 个字节,还是想为每个州名称使用一大堆字节?

It's a key component of relational databases. Having an integer relate to a state instead of having the whole state name saves a bunch of space in your database! Imagine you have a million records referencing your state table. Do you want to use 4 bytes for a number on each of those records or do you want to use a whole crapload of bytes for each state name?

贪了杯 2024-10-04 00:54:22

以下是一些实际考虑因素。

大多数现代 ORM(rails、django、hibernate 等)在只有一个整数列作为主键时效果最佳。

此外,采用标准命名约定(例如 id 作为主键,table_name_id 作为外键)可以更轻松地识别键。

Here are some practical considerations.

Most modern ORMs (rails, django, hibernate, etc.) work best when there is a single integer column as the primary key.

Additionally, having a standard naming convention (e.g. id as primary key and table_name_id for foreign keys) makes identifying keys easier.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文