SQL Server 规范化策略:varchar 与 int Identity
我只是想知道这里的最佳解决方案是什么。
假设我有一个规范化的数据库。 整个系统的主键是varchar。 我想知道的是我应该将这个 varchar 与 int 关联起来以进行规范化还是保留它? 保留为 varchar 更简单,但可能更优化
例如我可以
People
======================
name varchar(10)
DoB DateTime
Height int
Phone_Number
======================
name varchar(10)
number varchar(15)
或者我当然可以
People
======================
id int Identity
name varchar(10)
DoB DateTime
Height int
Phone_Number
======================
id int
number varchar(15)
添加其他几个一对多关系。
大家觉得怎么样? 哪个更好?为什么?
I'm just wondering what the optimal solution is here.
Say I have a normalized database. The primary key of the whole system is a varchar. What I'm wondering is should I relate this varchar to an int for normalization or leave it? It's simpler to leave as a varchar, but it might be more optimal
For instance I can have
People
======================
name varchar(10)
DoB DateTime
Height int
Phone_Number
======================
name varchar(10)
number varchar(15)
Or I could have
People
======================
id int Identity
name varchar(10)
DoB DateTime
Height int
Phone_Number
======================
id int
number varchar(15)
Add several other one-to-many relationships of course.
What do you all think? Which is better and why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
其他人似乎没有提到的一件事是 int 字段上的联接往往比 varchar 字段上的联接性能更好。
我肯定会始终使用代理键而不是使用(人或企业)名称,因为随着时间的推移,它们永远不会是唯一的。 例如,在我们的数据库中,有 164 个名称,其中有 100 多个同名实例。 这清楚地表明了考虑使用名称作为关键字段的危险。
One thing that others don't seem to have mentioned is that joins on int fields tend to perform better than joins on varchar fields.
And I definitely would always use a surrogate key over using names (of people or businesses) because they are never unique over time. In our database, for instance, we have 164 names with over 100 instances of the same name. This clearly shows the dangers of considering using name as a key field.
最初的问题不是标准化问题。 如果您有一个规范化的数据库,正如您所说,那么您不需要因为规范化原因而更改它。
你的问题确实有两个问题。 第一个是 int 或 varchar 是否更适合用作主键和外键。 第二个问题是是否可以使用问题定义中给出的自然键,或者是否应该生成合成键(代理键)来代替自然键。
int 比 varchar 更简洁,并且对于索引处理等事情更有效。 但差异并不是压倒性的。 您可能不应该仅根据此做出决定。
所提供的自然键是否真的可以作为自然键使用的问题更为重要。 “名称”列中的重复问题并不是唯一的问题。 还有一个问题是当一个人改变名字时会发生什么。 这个问题可能不会出现在您给出的示例中,但它确实会出现在许多其他数据库应用程序中。 一个例子是学生四年内所修所有课程的成绩单。 一个女人可能会在四年内结婚并改名,但现在你却陷入了困境。
您要么必须保持姓名不变,在这种情况下,它不再与现实世界相符,要么在该人参加的所有课程中追溯更新它,这使得数据库与当时打印的名册不一致。
如果您确实决定使用合成密钥,那么您现在必须决定应用程序是否要向用户社区透露合成密钥的值。 这是另一堆蠕虫病毒,超出了本次讨论的范围。
The original question is not one of normalization. If you have a normalized database, as you stated, then you do not need to change it for normalization reasons.
There are really two issues in your question. The first is whether ints or varchars a preferable for use as primary keys and foreign keys. The second is whether you can use the natural keys given in the problem definition, or whether you should generate a synthetic key (surrogate key) to take the place of the natural key.
ints are a little more concise than varchars, and a little more efficient for such things as index processing. But the difference is not overwhelming. You should probably not make your decision on this basis alone.
The question of whether the natural key provided really works as a natural key or not is much more significant. The problem of duplicates in a "name" column is not the only problem. There is also the problem of what happens when a person changes her name. This problem probably doesn't surface in the example you've given, but it does surface in lots of other database applications. An example would be the transcript over four years of all the courses taken by a student. A woman might get married and change her name in the course of four years, and now you're stuck.
You either have to leave the name unchanged, in which case it no longer agrees with the real world, or update it retroactively in all the courses the person took, which makes the database disagree with the printed rosters made at the time.
If you do decide on a synthetic key, you now have to decide whether or not the application is going to reveal the value of the synthetic key to the user community. That's another whole can of worms, and beyond the scope of this discussion.
如果“名称”字段确实适合作为主键,那么就这样做。 在这种情况下,通过创建代理键,数据库将不会变得更加规范化。 您将得到一些重复的外键字符串,但这不是规范化问题,因为 FK 约束保证字符串的完整性,就像代理键一样。
但是您没有解释“名称”是什么。 在实践中,字符串适合作为主键的情况很少。 如果是一个人的名字,就无法进行PK,因为多个人可以有相同的名字,人们可以更改名字等等。
If the "name" field really is appropriate as a primary key, then do it. The database will not get more normalized by creating a surrogate key in that case. You will get some duplicate strings for foreign keys, but that is not a normalization issue, since the FK constraint guarantrees integrity on strings just as it would on surrogate keys.
However you are not explaining what the "name" is. In practice it is very seldom that a string is appropriate as a primary key. If it is the name of a person, it wont work as a PK, since more than one person can have the same name, people can change names and so on.
我认为如果您的 VARCHAR 更大,您会注意到您在整个数据库中复制了相当多的数据。 然而,如果您使用数字 ID 列,则在将外键列添加到其他表时,您不会复制几乎相同数量的数据。
此外,文本数据在比较方面是一种巨大的痛苦,当你做 WHERE id = user_id 与 WHERE name LIKE inputname (或类似的事情)时,你的生活要容易得多)。
I think if your VARCHAR was larger you would notice you're duplicating quite a bit of data throughout the database. Whereas if you went with a numeric ID column, you're not duplicating nearly the same amount of data when adding foreign key columns to other tables.
Moreover, textual data is a royal pain in terms of comparisons, your life is much easier when you're doing WHERE id = user_id versus WHERE name LIKE inputname (or something similar).
你真的可以使用名称作为主键吗? 几个人同名的话风险不是很高吗?
如果您真的很幸运,您的 name 属性可以用作主键,那么 - 无论如何 - 使用它。 但通常情况下,您必须编写一些内容,例如 customer_id 等。
最后:“NAME”是至少一个 DBMS 中的保留字,因此请考虑使用其他内容,例如全名。
Can you really use names as primary keys? Isn't there a high risk of several people with the same name?
If you really are so lucky that your name attribute can be used as primary key, then - by all means - use that. Often, though, you will have to make something up, like a customer_id, etc.
And finally: "NAME" is a reserved word in at least one DBMS, so consider using something else, e.g. fullname.
使用任何类型的非合成数据(即来自用户的任何数据,而不是由应用程序生成的数据)作为 PK 都是有问题的; 您必须担心文化/本地化差异、区分大小写(以及取决于数据库排序规则的其他问题),如果/当用户输入的数据发生更改等时可能会导致数据问题。
使用非用户生成的数据(顺序 GUID) (或者非顺序,如果您的数据库不支持它们或者您不关心页面拆分)或身份整数(如果您不需要 GUID))则更容易且更安全。
关于重复数据:我不明白使用非合成密钥如何保护您免受重复数据的影响。 您仍然遇到用户输入“Bob Smith”而不是“Bob K. Smith”或“Smith,Bob”或“bob smith”等的问题。无论您的密钥是否是合成的,重复管理都是必要的(并且几乎相同)或非合成密钥,非合成密钥具有合成密钥巧妙避免的许多其他潜在问题。
许多项目不需要担心这一点(例如,严格约束的排序规则选择避免了其中的许多排序规则),但总的来说,我更喜欢合成键。 这并不是说你不能使用有机钥匙取得成功,显然你可以,但对于许多项目来说,它们并不是更好的选择。
Using any kind of non-synthetic data (i.e. anything from the user, as opposed to generated by the application) as a PK is problematic; you have to worry about culture/localization differences, case sensitivity (and other issues depending on DB collation), can result in data problems if/when that user-entered data ever changes, etc.
Using non-user-generated data (Sequential GUIDs (or non-sequential if your DB doesn't support them or you don't care about page splits) or identity ints (if you don't need GUIDs)) is much easier and much safer.
Regarding duplicate data: I don't see how using non-synthetic keys protects you from that. You still have issues where the user enters "Bob Smith" instead of "Bob K. Smith" or "Smith, Bob" or "bob smith" etc. Duplication management is necessary (and pretty much identical) regardless of whether your key is synthetic or non-synthetic, and non-synthetic keys have a host of other potential issues that synthetic keys neatly avoid.
Many projects don't need to worry about that (tightly constrained collation choices avoid many of them, for example) but in general I prefer synthetic keys. This is not to say you can't be successful with organic keys, clearly you can, but for many projects they're not the better choice.
我相信大多数开发过任何大型现实世界数据库应用程序的人都会告诉您代理键是唯一现实的解决方案。
我知道学术界会不同意,但这就是理论纯粹性和实用性之间的区别。
任何必须在使用非代理键的表之间进行联接(其中某些表具有复合主键)的合理大小的查询很快就会变得难以维护。
I believe that the majority of people who have developed any significant sized real world database applications will tell you that surrogate keys are the only realistic solution.
I know the academic community will disagree but that is the difference between theoretical purity and practicality.
Any reasonable sized query that has to do joins between tables that use non-surrogate keys where some tables have composite primary keys quickly becomes unmaintainable.