不必要的正常化
我和我的朋友正在建立一个网站,并且存在重大分歧。该网站的核心是关于“人”的评论数据库。基本上人们可以输入评论,也可以输入评论的对象。然后,观众可以在数据库中搜索评论中的单词或人名的一部分。它完全是用户生成的。例如,如果有人想对某人名字的拼写错误版本发表评论,他们可以,那没问题。因此,可能会有不同人的多种拼写被列为几个不同的条目(有些是中间名,有些是昵称,有些是拼写错误的,等等),但这都没关系。我们不在乎人们是否对随机的人或虚构的人发表评论。
无论如何,问题在于我们如何构建数据库。现在它只是一张以评论 ID 作为主键的表,然后有一个用于评论所涉及的“人”的字段:
评论 ID - 评论 - 人物
1 - “他很奇怪” - John Smith
2 - “臭女孩” - 珍妮
3 - “同性恋” - 约翰·史密斯
4 - “欠我 20 美元” - 珍妮yyyyyyyy
一切正常。使用数据库,我可以创建列出特定“人”的所有“评论”的页面。然而,他对数据库未标准化感到困扰。我阅读了有关标准化的内容,发现他错了。该表目前已标准化,因为评论 ID 是唯一的,并且规定了“评论”和“人物”。现在他坚持认为“人”应该有自己的桌子,因为它是“物”。我认为没有必要,因为尽管“人”确实是更大的容器(一个“人”可以有很多关于他们的“评论”),但数据库似乎运行得很好,“人”是评论 ID。我使用不同的 PHP 调用来执行不同的 SQL 选择,使其在输出以及用户搜索和查看结果的不同方式上神奇地显得更加复杂,但实际上,设置非常简单。我现在让用户对评论进行排序,并在同一张表上保留一个“分数”作为另一个字段。
我认为目前不需要为唯一的“人员”条目建立一个单独的表,因为“人员”没有自己的“分数”或任何自己的属性。只有评论才可以。我的朋友如此坚持认为这是为了效率所必需的。最后我说,“好吧,如果你想让我创建一个单独的表,并让“人”成为它自己的字段,那么第二个字段是什么?因为如果一个表只有一个列,那似乎毫无意义。我同意我们稍后可能会需要为“人”提供自己的桌子,但我们可以在那时处理这个问题。”然后他说字符串不能成为主键,我们会将当前表中的“persons”转换为数字,数字将成为新“person”表中的主键。对我来说,这似乎是不必要的,它会使当前的表格更难阅读。他还认为以后不可能创建第二个表,而且我们现在需要预测以后可能需要它。
谁是对的?
My friend and I are building a website and having a major disagreement. The core of the site is a database of comments about 'people.' Basically people can enter comment and they can enter the person the comment is about. Then viewers can search the database for words that are in the comment or parts of the person name. It is completely user generated. For example, if someone wants to post a comment on a mispelled version of a person's name, they can, and that's OK. So there may be multiple spellings of different people listed as several different entries (some with middle name, some with nickname, some mispelled, etc.), but this is all OK. We don't care if people make comments about random people or imaginary people.
Anyway, the issue is about how we are structuring the database. Right now it is just one table with the comment ID as the primary key, and then there is a field for the 'person' the comment is about:
comment ID - comment - person
1 - "he is weird" - John Smith
2 - "smelly girl" - Jenny
3 - "gay" - John Smith
4 - "owes me $20" - Jennyyyyyyyyy
Everything is working fine. Using the database, I am able to create pages that list all the 'comments' for a particular 'person.' However, he is obsessed that the database isn't normalized. I read up on normalization and learned that he was wrong. The table IS currently normalized, because the comment ID is unique and dictates the 'comment' and the 'person.' Now he is insistant that 'person' should have it's OWN table because it is a 'thing.' I don't think it is necessary, because even though 'person' really is the bigger container (one 'person' can have many 'comments' about them), the database seems to operate just fine with 'person' being an attribute of the comment ID. I use various PHP calls for different SQL selections to make it magically appear more sophisticated on the output and the different way the user can search and see results, but in reality, the set-up is quite simple. I am now letting users rank comments with thumbs up and thumbs down, and I keep a 'score' as another field on the same table.
I feel that there is currently no need to have a separate table for just unique 'person' entries because the 'persons' don't have their own 'score' or any of their own attributes. Only the comments do. My friend is so insistant that it is necessary for efficiency. Finally I said, "OK, if you want me to create a separate table and let 'person' be it's own field, then what would be the second field? Because if a table has just a single column, it seems pointless. I agree that we may later create a need to give 'person' it's own table, but we can deal with that then." He then said that strings can't be primary keys, and that we would convert the 'persons' in the current table to numbers, and the numbers would be the primary key in the new 'person' table. To me this seems unnecessary and it would make the current table harder to read. He also thinks it will be impossible to create the second table later, and that we need to anticipate now that we might need it for something later.
Who is right?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
为 person 创建一个新表并使用该表的键代替 person 属性与规范化无关。出于其他原因,这可能是一个好主意,但这样做并不会让数据库比不这样做“更加规范化”。所以你是对的:就规范化而言,创建另一个表是不必要的。
Creating a new table for person and using the key of that table in place of the person attribute has nothing to do with normalization. It may be a good idea for other reasons but doing so does not make the database "more normalized" than not doing it. So you are right: as far as normalization is concerned, creating another table is unnecessary.
在我看来,你的朋友是对的。
人应该住在不同的桌子上,你应该尝试正常化。不过,不要做得太过分。
从长远来看,您可能希望对您的网站做更多的事情,比如说您想将多个文件附加到一个人(即图片),那么您将非常感谢标准化。
In my opinion your friend is right.
Person should live in a different table and you should try to normalize. Don't overdo-it, though.
In the long run you may want to do more things with your site, say you want to attach multiple files to a person (ie. pictures) you'll be very thankfull then for the normalization.
我会投票给你的朋友。我喜欢标准化并为未来做好计划,即使你永远不需要它,这种标准化也很容易做到,几乎不需要时间。您可以创建一个查询视图,以使您的 SQL 更简洁,并且无需您自己连接表。
I would vote for your friend. I like to normalize and plan for the future and even if you never need it, this normalization is so easy to do it literally takes no time. You can create a view that you query in order to make your SQL cleaner and eliminate the need for you to join the tables yourself.
如果你已经达到了所有的能力并且没有扩展能力的计划,我认为你就保持原样。
如果您计划添加更多内容,即允许人们拥有帐户或其他任何内容,我认为将数据分成“人员”、“评论”表可能是明智的。这并不难,并且可以更轻松地扩展您的功能。
If you have already reached all of your capabilities and have no plans for expansion of capabilities I think you leave it as it is.
If you plan to add more, namely allowing people to have accounts, or anything really, I think it might be smart to separate your data into Person, Comments tables. Its not hard and makes expanding your functionality easier.
你说得对。
Person
可能是一般事物,但不在您的模型中。如果您想让人们正确识别他们正在谈论的人,则需要一个Person
表。例如,如果评论仅涉及已在数据库中注册的人员。但这里看起来你有一个非结构化数据,没有身份;并且没有人有兴趣确定“jenny”和“jennyyy”实际上是否是同一个人,更不用说“jenny doe”和“我的表弟”......
You're right.
Person
may be a thing in general, but not in your model. If you were going to hassle people into properly identifying the person they're talking about, aPerson
table would be necessary. For example, if the comments were only about persons already registered in the database.But here it looks like you have an unstructured data, without identity; and that nothing/nobody is interested in making sure whether "jenny" and "jennyyy" are in fact the same person, not to mentionned "jenny doe", and "my cousin"...
嗯,有两种思想流派。有人说,以尽可能最规范化的方式创建数据模型,如果需要更高的效率,则进行反规范化。另一个基本上是“做工作所需的最少工作,然后随着需求的变化而改变”。也称为 YAGNI(您不会需要它)。
这完全取决于你对这件事的看法。如果这就是全部,那么您的方法可能没问题。如果您打算随着时间的推移通过新功能来改进它,那么您的朋友是对的。
Well, there are two schools of thought. One says, create your data model in the most normalized way possible, then de-normalize if you need more efficiency. The other is basically "do the minimum work necessary for the job, then change it as your requirements change". Also known as YAGNI (You aren't going to need it).
It all depends on where you see this going. If this is all it will be, then your approach is probably fine. If you intend to improve it with new features over time, then your friend is right.
如果您从未打算将人员列与用户或其他任何内容相关联,并且数据显然不需要一致性或数据完整性检查,那么为什么这会出现在关系数据库中呢?这难道不是 nosql 数据库的一个用例吗?或者我错过了什么?
If you never intend to associate the person column with a user or anything else and data apparently needs no consistency or data integrity checks, just why is this in a relational database at all? Wouldn't this be a use case for a nosql database? Or am I missing something?
规范化是关于函数依赖(FD)的。您需要识别所有
在数据模型完全标准化之前,FD 存在于数据模型的属性中。
让我们回顾一下您所拥有的内容:
CommentId
的任何给定实例都可以在功能上确定Person
(FD:CommentId
->Person)
CommentId
的任何给定实例在功能上确定Comment
(FD:CommentId
->Comment
)UserId
(FD:CommentId
->UserId
)CommentId
的任何给定实例CommentId
的功能决定了Score
(FD:CommentId
->Score
)这里的所有内容都是依赖属性关于
CommentId
和单独的CommentId。这可能会让您相信包含全部或子集的关系(表)
上述属性必须标准化。
首先要问自己的是,为什么要创建
CommentId
属性?严格来说,这是一个制造的属性——它与任何“真实”的东西无关。评论ID是
通常称为代理键。代理键只是一个代表的虚构值
对应于其他一些属性组的唯一值集。那么CommentId是哪组属性呢?
的代理人?我们可以算出
通过提出以下问题并向模型添加新的 FD:
Comment
->CommentId
必须为 true。FD:
人物
+评论
->CommentId
必须为 true,并且上面 1 中的 FD 为 false。不同的UserId?如果是这样,则 1 和 2 中的 FD 不可能为真,但
FD:
人员
+评论
+用户ID
->CommentId
可能是 true。有不同的分数吗?这意味着 FD:
Person
+Comment
+UserId
' +Score
->CommentId
为 true,其他为 false。上述 4 个 FD 中必须有一个为真。无论哪种都会影响数据模型的标准化方式。
假设FD:
Person
+Comment
+UserId
->CommentId
结果是正确的。逻辑性后果是:
Person
+Comment
+UserId
和CommentId
作为相对于Score 的等效键
Score
应与其一个键而不是两个键建立关系(以避免传递依赖)。显而易见的选择是
CommentId
,因为它是专门作为代理创建的。CommentId
、Person
、Comment
、UserId
来绑定其替代品的关键。
从理论上来说,代理键
CommentId
不是使您的数据模型或数据库正常工作所需的。然而,它的存在可能会影响关系的构建方式。
代理键的创建是一个具有一定重要性的实际问题。
考虑一下如果您选择不使用代理键而是使用完整的代理键,可能会发生什么情况
属性集
Person
+Comment
+UserId
代替,特别是在需要时在多个表上作为外键或主键:
到您的数据库,因为它在多个表中重复。它可能超过几个字符长。
对于 Comment 是键的一部分的所有表。不是什么美丽的景象!
分配给代理键的值永远不会改变,无论您对值做什么
与其确定的属性相关联。现在更新依赖属性
仅限于定义代理键的一张表。这具有巨大的现实意义。
现在回到是否应该为
Person
创建代理。Person
还活着吗在许多或任何 FD 的左侧?如果是这样,它的价值将通过您传播
数据库,并且有一个为其创建代理的情况。 Person 是文本属性还是数字属性与创建代理键的选择无关。
根据你所说的,最多只有一个微弱的论据来创建一个
Person
的代理。这个论点是基于这样的怀疑:它的值可能在未来的某个时刻成为密钥或密钥的一部分。Normalization is all about functional dependencies (FD's). You need to identify all of the
FD's that exist among the attributes of your data model before it can be fully normalized.
Lets review what you have:
CommentId
functionally determines thePerson
(FD:CommentId
->Person
)CommentId
functionally determines theComment
(FD:CommentId
->Comment
)CommentId
functionally determines theUserId
(FD:CommentId
->UserId
)CommentId
functionally determines theScore
(FD:CommentId
->Score
)Everything here is a dependant attribute on
CommentId
andCommentId
alone. This might lead you to the belief that a relation (table) containing all of, or a subset of, theabove attributes must be normalized.
First thing to ask yourself is why did you create the
CommentId
attribute anyway? Strictly speaking,this is a manufactured attribute - it does not relate to anything 'real'. CommentId is
commonly referred to as a surrogate key. A surrogate key is just a made up value that stands in
for a unique value set corresponding to some other group of attributes. So what group of attributes is
CommentId
a surrogate for? We can figure that
out by asking the following questions and adding new FD's to the model:
Comment
->CommentId
must be true.FD:
Person
+Comment
->CommentId
must be true and the FD in 1 above is false.different UserId's? If so, the FDs in 1 and 2 cannot be true but
FD:
Person
+Comment
+UserId
->CommentId
may be true.have different Scores? This implies FD:
Person
+Comment
+UserId
' +Score
->CommentId
is true and the others are false.Exactly one of the above 4 FD's above must be true. Whichever it is affects how your data model is normalized.
Suppose FD:
Person
+Comment
+UserId
->CommentId
turns out to be true. The logicalconsequences are that:
Person
+Comment
+UserId
andCommentId
serve as equivalent keys with respect toScore
Score
should be put in a relation with one but not both of its keys (to avoid transitive dependencies).The obvious choice would be
CommentId
since it was specifically created as a surrogate.CommentId
,Person
,Comment
,UserId
is needed to tie theKey to its surrogate.
From a theoretical point of view, the surrogate key
CommentId
is notrequired to make your data model or database work. However, its presence may affect how relations are constructed.
Creation of surrogate keys is a practical issue of some importance.
Consider what might happen if you choose to not use a surrogate key but the full
attribute set
Person
+Comment
+UserId
in its place, especially if it was requiredon multiple tables as a foreign or primary key:
to your database because it is repeated in multiple tables. It is probably more than a couple of characters long.
to all tables where Comment is part of a key. Not a pretty sight!
The value assigned to a surrogate key never changes, no matter what you do to the values
associated to the attributes that it determines. Updating the dependant attributes is now
limited to the one table defining the surrogate key. This is of huge practical significance.
Now back to whether you should be creating a surrogate for
Person
. DoesPerson
liveon the left hand side of many, or any, FDs? If it does, its value will propogate through your
database and there is a case for creating a surrogate for it. Whether Person is a text or numeric attribute is irrelevant to the choice of creating a surrogate key.
Based on what you have said, there is at best a weak argument to create a
surrogate for
Person
. This argument is based on the suspicion that its value may at some point become a key or part of a key at some point in the future.事情是这样的。每当你创造一些东西时,你都想确保它有成长的空间。您想要尝试预测您的计划的未来项目和未来进展。在这种情况下,您说得对,当前不需要添加仅包含 1 个字段的人员表(不计算 ID,假设您有一个 int ID 字段和一个人名)。然而,将来,您可能希望为这些人提供其他属性,例如名字、姓氏、电子邮件地址、添加日期等。
虽然过度规范化肯定是有害的,但我个人会创建另一个更大的表来保存具有附加字段的人,以便我将来可以轻松添加新功能。
Here's the deal. Whenever you create something, you want to make sure that it has room to grow. You want to try to anticipate future projects and future advancements for your program. In this scenario, you're right in saying that there is no need currently to add a persons table that just holds 1 field (not counting the ID, assuming you have an int ID field and a person name). However, in the future, you may want to have other attributes for such people, like first name, last name, email address, date added, etc.
While over-normalizing is certainly harmful, I personally would create another, larger table to hold the person with additional fields so that I can easily add new features in the future.
每当你与用户打交道时,都应该有一张专用的桌子。然后您可以连接表并引用该用户的 ID。
SQL 将评论加入用户:
将来当您想要查找有关特定用户的信息时,这将使事情变得更加容易。额外的努力可以忽略不计。
关于每个评论的“分数”,也应该是一个单独的表。这样您就可以将用户与“喜欢”或“不喜欢”联系起来。
Whenever you're dealing with users, there should be a dedicated table. Then you can just join the tables and refer to that user's ID.
SQL to join the comments to the users:
It'll make it so much easier in the future when you want to find information about that specific user. The amount of extra effort is negligible.
Concerning the "score" for each comment, that should also be a separate table as well. That way you can connect a user to a "like" or "dislike."
有了这个数据库,你可能觉得还可以,但是以后你想让用户从数据库中了解更多的时候可能会出现一些问题。假设你想知道一个名为=的人的评论数量'abc'。在这种情况下,您将必须遍历整个评论表并继续计数。取而代之的是,您可以为每个人设置一个名为“count”的属性,并在对该人发表评论时递增该属性人。
就规范化而言,拥有规范化的数据库总是更好,因为它减少了冗余并使数据库直观易懂。如果您预计您的数据库将来会变得很大,那么必须进行规范化。
With this database, you might feel that it is okay but there may be some problem in the future when you want the users to know more from the database.Suppose you want to know about the number of comments made on a person with the name='abc'.In this case ,you will have to go through the entire table of comments and keep counting.In place of this, you can have an attribute called 'count' for every person and increment it whenever a comment is made on that person.
As far as normalization is concerned,it is always better to have a normalized database because it reduces redundancy and makes the database intuitive to understand. If you are expecting that your database will go large in future then normalization must be present.