使数据库 ID 一致且“可读”的优点和缺点
问题
数据库ID“无意义”是一个好的经验法则吗?相反,将 ID 构建为一目了然的方式是否会带来显着好处?有什么优点和缺点?
背景
我刚刚与同事就数据库中 ID 的一致性进行了争论。我们有一个利用 Spring 的数据驱动应用程序,因此我们很少需要更改代码。这意味着,如果出现问题,更改数据通常就是解决方案。
我的观点是,从长远来看,通过使 ID 保持一致和可读,我们可以节省大量时间并减少麻烦。一旦设置了 ID,就不必经常更改,如果做得正确,将来的更改也不会很困难。我同事的立场是,身份证永远不重要。将信息编码到 ID 中违反了数据库设计策略,并且保持它们有序需要额外的工作,“我们没有时间”。我在网上找不到任何支持这两种立场的内容。所以我向南澳的所有专家求助!
示例
想象一下代表杂货店食品的数据库记录的简化列表,第一组代表具有在 ID 中编码的含义的数据,而第二组则没有:
ID 的含义:
Type
1 Fruit
2 Veggie
Product
101 Apple
102 Banana
103 Orange
201 Lettuce
202 Onion
203 Carrot
Location
41 Aisle four top shelf
42 Aisle four bottom shelf
51 Aisle five top shelf
52 Aisle five bottom shelf
ProductLocation
10141 Apple on aisle four top shelf
10241 Banana on aisle four top shelf
//just by reading the ids, it's easy to recongnize that these are both Fruit on Aisle 4
ID 没有意义:
Type
1 Fruit
2 Veggie
Product
1 Apple
2 Banana
3 Orange
4 Lettuce
5 Onion
6 Carrot
Location
1 Aisle four top shelf
2 Aisle four bottom shelf
3 Aisle five top shelf
4 Aisle five bottom shelf
ProductLocation
1 Apple on aisle four top shelf
2 Banana on aisle four top shelf
//given the IDs, it's harder to see that these are both fruit on aisle 4
摘要
保持 ID 可读和一致的优点和缺点是什么?您通常更喜欢哪种方法?为什么?是否有公认的行业最佳实践?
-------- 编辑( 下面来自评论的有用背景信息 ): --------
在我们的表中,主键始终是包含唯一整数的 ID 字段。起初,该整数是任意的。随着时间的推移,其中一些 ID 自然而然地在开发人员/测试人员中具有了意义。在最近的一次重构中,某些开发人员还花了一些时间让所有 ID 更容易识别。它使每个人的工作都轻松了 100 倍。有些人(实际上没有使用数据/代码)出于理论上的原因强烈反对。实际上,这些反对意见没有一个是成立的。此外,所有使用该数据的开发人员都同意,它现在更容易维护。
我正在寻找(但尚未看到)反对在以数据为中心的环境中使用可立即识别的 ID 的合理论据。
Question
Is it a good rule of thumb for database IDs to be "meaningless?" Conversely, are there significant benefits from having IDs structured in a way where they can be recognized at a glance? What are the pros and cons?
Background
I just had a debate with my coworkers about the consistency of the IDs in our database. We have a data-driven application that leverages spring so that we rarely ever have to change code. That means, if there's a problem, a data change is usually the solution.
My argument was that by making IDs consistent and readable, we save ourselves significant time and headaches, long term. Once the IDs are set, they don't have to change often and if done right, future changes won't be difficult. My coworkers position was that IDs should never matter. Encoding information into the ID violates DB design policies and keeping them orderly requires extra work that, "we don't have time for." I can't find anything online to support either position. So I'm turning to all the gurus here at SA!
Example
Imagine this simplified list of database records representing food in a grocery store, the first set represents data that has meaning encoded in the IDs, while the second does not:
ID's with meaning:
Type
1 Fruit
2 Veggie
Product
101 Apple
102 Banana
103 Orange
201 Lettuce
202 Onion
203 Carrot
Location
41 Aisle four top shelf
42 Aisle four bottom shelf
51 Aisle five top shelf
52 Aisle five bottom shelf
ProductLocation
10141 Apple on aisle four top shelf
10241 Banana on aisle four top shelf
//just by reading the ids, it's easy to recongnize that these are both Fruit on Aisle 4
ID's without meaning:
Type
1 Fruit
2 Veggie
Product
1 Apple
2 Banana
3 Orange
4 Lettuce
5 Onion
6 Carrot
Location
1 Aisle four top shelf
2 Aisle four bottom shelf
3 Aisle five top shelf
4 Aisle five bottom shelf
ProductLocation
1 Apple on aisle four top shelf
2 Banana on aisle four top shelf
//given the IDs, it's harder to see that these are both fruit on aisle 4
Summary
What are the pros and cons of keeping IDs readable and consistent? Which approach do you generally prefer and why? Is there an accepted industry best-practice?
-------- edit ( helpful background info from comments, below ): --------
In our tables, the Primary Key is always an ID field containing a unique integer. At first, that integer was arbitrary. Over time, some of these IDs naturally took on meaning among developers/testers. During a recent refactor, certain developers also took time to make all IDs easier to recognize. It made everyone's job 100X easier. Some people (who don't actually use the data/code) vehemently disagreed for theoretical reasons. In practice, not one of those objections are holding true. Moreover, all developers using the data agree that it's now significantly easier to maintain.
I'm looking for (but haven't seen) a defensible argument against using immediately recognizable IDs in a data-centric environment.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
缺点:我刚刚将“过道五顶架”更改为“过道六顶架”,所以现在我必须将其 ID 更改为 61,现在我必须将“过道五顶架上的葡萄”的 ProductLocation ID 更改为 10461,哦上帝啊,货架位置 ID 字符串还会出现在我数据库的 ID 中哦上帝啊,无论谁设计了带有含义的 ID,都应该被枪杀,现在是凌晨四点,一切都变得疯狂了,为什么“第七过道底层货架”有一个 ID 41 死了死了。
Con: I just changed "Aisle Five top shelf" to "Aisle Six top shelf" so now I have to change its ID to 61 and now I have to change the ProductLocation ID for "Grapes on Aisle five top shelf" to be 10461 and oh god where else does the shelf location ID string show up in IDs in my database oh god whoever designed IDs to carry meaning ought to be shot it's four in the morning and everything's gone crazy and why does "Aisle seven bottom shelf" have an ID of 41 die die die.
使用数据库 ID 对行信息进行编码存在几个问题。如果您希望胡萝卜的“ID”为 203,则应添加一个
product_id
列(例如)并将此信息放在那里。为什么?ID 唯一需要的目的是唯一标识表中的行。如果它可以提供良好的查找性能,那就是一个好处,如果它可以紧凑地存储,那就是另一个好处。但除了该实体的唯一标识符之外,它不应包含有关其标识的行中的实体的任何信息。
There are several problems with using database IDs to encode information about a row. If you want your carrots to have an "ID" of 203, you should add a
product_id
column (for example) and put this information there instead. Why?The only required purpose of an ID is to uniquely identify a row within a table. If it can provide good lookup performance, that's a bonus, and if it can be compactly stored, that's another bonus. But it shouldn't contain any information about the entity in the row it identifies, other than the unique identifier of that entity.
好吧,假设您的
10141
“Apple 位于第四通道”,那么当您最终将产品10
放在货架1
的通道中时会发生什么>41 ?或者是货架1
过道014
中的产品1
,还是过道中的产品
101
41 坐在地板上是因为它不在架子上?一旦开始像这样混合数据,您通常会失去可靠提取组件的能力。人类可读的密钥固然很好,但你永远不会破坏人类形态所基于的各个 ID。
Well, given your
10141
"Apple is in aisle four", what happens when you end up with product10
in aisle1
on shelf41
? Or is that product1
in aisle014
on shelf1
, or is it product101
in aisle41
sitting on the floor because it's not on a shelf?Once you start co-mingling data like that, you generally lose any ability to reliably extract the components. Human-readable keys are all nice and dandy, but you never destroy the individual IDs the human form is based on.
“可读”是什么意思? ID 通常只是数字。 “一致”是什么意思? ID 通常只是递增的数字;没有比这更一致的了。当信息已经明确存在于数据库中时,为什么要浪费时间和精力尝试将信息编码到 ID 中呢?谁会使用“有序”ID?
What do you mean by "readable"? IDs are typically just numbers. And what do you mean by "consistent"? IDs are typically just incrementing numbers; you can't get much more consistent than that. Why waste time and effort trying to encode information into the ID, when the information will already be present explicitly in the database? Who would make use of "orderly" IDs?
这是我对代理键的看法。 (或者ID,如果你想这样称呼它们的话)
代理键没有商业意义。它们用于唯一标识行。但他们所做的不仅仅是识别行。他们也是行的“灵魂”。它不能更改或交易。如果代理遵循“灵魂”原则,那么当您删除该行时,新行将永远不会采用死行的值。即使它已经死了,灵魂仍然属于被删除的行。
我喜欢我的代理人是“灵魂”,尽管这并不一定是代理人。
代理的优点是它永远不需要改变。如果 30 个其他表具有主表的外键,那么当主表的 PK 更改时,您不希望更新所有 30 个表。您仍然可以在该可能更改的值上使用 CANDIDATE 键,但由于它可以更改,因此它不是行的灵魂。
代理键通常是自动递增整数。这非常适合聚集索引。您的表连接将尽可能好。自然键往往会产生可怕的聚集索引,因为新值很少是连续的。整数是小型、固定长度的数据类型,可以实现更快的匹配。
如果你的名字变了,你还是你。如果你烧掉你的指纹,你还是你。上帝正在使用代理键,所以我认为我们可以在数据库中使用它们。
编辑
更仔细地阅读你的问题后,我认为你实际上只是以错误的方式使用“无意义的键”。
您的值“10141”代表苹果/位置关联。即 2 个代理项组合成 1 个字段。将它们保留为单独的字段“101”和“41”,并在这些字段的组合上进行 PK。将它们分开将使搜索、索引、表连接等变得更容易。
你是对的,你不需要映射表上的另一个代理。 2 个代理人的组合本身就是一个代理人(尽管不是灵魂)。只需在 2 个单独的列中表达组合,而不是合并到 1 列中。
结束编辑
Here is my take on Surrogate keys. (or ID's if you want to call them that)
Surrogate keys have no business meaning. They are used to uniquely identify the row. But they do more than merely identify the row. They are also the "soul" of the row. It can't be changed or traded. If the surrogate follows the "soul" principle then when you delete the row, a new row will never take the dead row's value. The soul still belongs to the deleted row even after it's dead and gone.
I like my surrogates to be "souls", though that is not required to be a surrogate.
The advantage of a surrogate is it never needs to change. If 30 other tables have a foriegn key to your main table, you don't want to update all 30 when the main table's PK changes. You can still have a CANDIDATE key on that potentially changing value, but since it can change it is not the row's soul.
Surrogate keys are often auto-increment integers. This lends itself PERFECTLY for clustered indexes. Your table joins will be as good as they can possibly be. Natural keys tend to make horrible clustered indexes as new values are rarely sequential. Integers are small, fixed length data types for even faster matching.
If your name changes, you're still you. If you burn off your finger prints, you're still you. God is using a surrogate key, so I think it's OK for us to use them in our databases.
EDIT
After reading your question more carefully, I think you are actually using "meaningless keys" just in the wrong way.
You have value "10141" to represent the apple/location association. That is the combination of 2 surrogates into 1 field. Keep them as separate fields "101" and "41" and make the PK on the combo of those fields. Keeping them separate will make it easier to search, index, table join, etc.
You are right, you don't need yet another surrogate on the mapping table. The combo of 2 surrogates is a surrogate in it's own right (though not a soul). Just express the combo in 2 separate columns, not combined into 1 column.
END EDIT
有意义的 ID 并不违反“数据库设计政策”!
恰恰相反,这正是真正的关系数据库从第一天起就存在的意义。如果您的数据包含某种从业务角度来看是唯一的属性组合,那么不将其设为 ID 通常会破坏 Boyce-Codd 范式。并带来随之而来的异常现象。
除非ID中编码的信息与其他字段中的信息冗余,否则就使用它。如果多余,则创建多列主键。它们在 ORM 方面不太方便,但在数据驱动的应用程序中它们是一种福音。
附录:(编辑原始问题后)
在您的情况下,对于数据驱动的应用程序,我会这样做:
通过这样的设置:
Meaningful ids are not against "db design policies"!
Quite the opposite, it's exactly what the real relational databases were about from the day one. If your data contains some combination of attributes that is - from the business point of view - unique, NOT making it an ID will usually break the Boyce-Codd normal form. And bring the anomalies that go with it.
Unless the information encoded in ID is redundant with what's in other fields, just use it. If it is redundant, make a multiple column primary key. They are not very handy with ORMs, but in data driven applications they are a blessing.
ADDENDUM: (after the edit of original question)
In your case, for a data driven application, I would do:
With such setup:
按键设计的三个主要标准是熟悉性、简单性和稳定性。如果您使用熟悉且简单的按键,那么用户会发现它们更容易识别、记住和使用。他们在输入和使用关键值时犯错误的可能性较小,并且数据质量和可用性通常会得到提高。
我建议您通过询问用户他们认为哪种类型的标识符更容易使用来解决这个问题,或者甚至与他们测试不同的方案(如果这对您来说非常重要)。无论如何,开发人员不应独自做出这样的决定。一些组织有部门或个人负责定义要使用的标准编码方案。
The three principal criteria for key design are Familiarity, Simplicity and Stability. If you use keys that are familiar and simple then users will find them easier to recognise, remember and use. They will be less likely to make mistakes when entering and using key values and data quality and usability will usually be improved.
I suggest you resolve this question by asking your users which type of identifier they find easier to use, or even testing out different schemes with them if it's very important to you. In any case, developers alone shouldn't make that decision. Some organisations have departments or individuals responsible for defining the standard coding schemes to be used.
ID 可能对您有意义,但对计算机不一定有意义。数据库软件不会足够智能来识别这样的模式(除非你对其进行编程,显然,这会比它的价值更麻烦)所以实际上你所做的只是创造了一个潜在的问题当与您未预见到的 ID 发生冲突时,为您自己提供帮助。
我理解您想要表达的观点,但良好的数据库设计涉及使数据库引擎的写入和读取尽可能容易。您最好设置索引并研究数据库性能以找到可以优化的区域。
The ID may be meaningful to you but not necessarily the computer. The database software is not going to be intelligent enough to pick up on a pattern like that (unless you program it to do so, obviously, which will be more trouble than its worth) so really all you've done is create a potential problem for yourself down the road when there's a conflict with IDs that you didn't foresee.
I understand the point that you're trying to make but good database design involves making things as easy as possible for the database engine to write and read. You're better off setting indexes and studying your database performance to find areas where you can optimize.
Zooko 三角 和 petnames 可能与此处相关。
Zooko's Triangle and the concept of petnames may be relevant here.
我想过对此发表评论,但我担心这可能太复杂了。
我认为共识是,一般来说 ID 不应该有意义。也许如果你将你的问题更多地限制在你的场景的具体细节上,意见会有所不同?
根据您的评论,听起来您正在从电子表格加载数据,我假设您正在使用有意义的 ID 作为确定不同数据之间关系的一种方式?
有没有理由不能让数据库处理自增ID而让用户(开发人员?)定义代码?这样您就可以通过外键维护引用完整性并进行适当的规范化。如果您确实需要快速浏览一下
data 那么你可以有一个具有某种命名约定的计算列。它甚至可能对您的需求更有意义?
例如
,实际上位置可以进一步标准化为过道和货架,但您明白了。
当数据插入数据库时,会为每条记录创建 ID,并且可以通过代码确定关系,并且可以将外键设置为正确的 ID。然后,您的应用程序可以在不了解代码的情况下处理 ID。
因此,产品位置看起来像这样:
如果您仍然需要更具描述性的内容,您只需在 SQL 中执行联接即可获取代码或创建计算列,或者您的应用程序可以将 ID 映射到缓存中的代码。
例如,
这会对性能造成一点影响,我仍然没有真正明白这一点,但也许这对您有帮助?
好吧,那太长了。 :)
I thought about making this a comment but I'm afraid it might be too involved.
I think the consensus opinion is that in general IDs shouldn't have meaning. Perhaps if you were to constrain your question more to the specifics of your scenario the opinion would be different?
Based on your comments it sounded like you are doing a data load from a spreadsheet and I assume you are using the meaningful IDs as a way of determining the relationships between different data?
Is there a reason why you can't let the database handle auto-increment IDs but let the users (developers?) define codes. This way you can maintain referential integrity via foreign keys and also be properly normalized. If you really need to have some sort of quick look at the
data then you could have a computed column with some sort of naming convention. It might even be more meaningful for your needs?
e.g.
Actually Location can be further normalized to aisles and shelves but you get the idea.
When the data is inserted into the database, IDs are created for each record and the relationships can be determined by the code and the foreign keys can be set to the proper ID. Your application can then deal with the IDs without knowledge of the codes.
So the Product Location would look something like:
If you still need something more descriptive you could just do a join in your SQL to get the codes or create a Computed Column or your app could map the IDs to the codes from a cache.
e.g.
That would be a bit of a performance hit and I still don't really see the point but maybe that helps you?
OK, that was way too long. :)
我认为这没有多大区别。当我有机会时,我总是倾向于重新播种我的 ID,但这只是我。我想如果你在代码中引用它们(例如枚举),那么在你的 id 中有一些顺序会很有用,除此之外我不会担心它。
I don't think it makes much difference. I always tend to reseed my id's when I have the opportunity, but that's just me. I suppose it would be useful have some order in your id's, if you refer to them in code [enum's for example], other than that I wouldn't worry about it.