不必要的正常化

发布于 2024-09-18 05:51:21 字数 1091 浏览 9 评论 0原文

我和我的朋友正在建立一个网站，并且存在重大分歧。该网站的核心是关于“人”的评论数据库。基本上人们可以输入评论，也可以输入评论的对象。然后，观众可以在数据库中搜索评论中的单词或人名的一部分。它完全是用户生成的。例如，如果有人想对某人名字的拼写错误版本发表评论，他们可以，那没问题。因此，可能会有不同人的多种拼写被列为几个不同的条目（有些是中间名，有些是昵称，有些是拼写错误的，等等），但这都没关系。我们不在乎人们是否对随机的人或虚构的人发表评论。

无论如何，问题在于我们如何构建数据库。现在它只是一张以评论 ID 作为主键的表，然后有一个用于评论所涉及的“人”的字段：

评论 ID - 评论 - 人物

1 - “他很奇怪” - John Smith

2 - “臭女孩” - 珍妮

3 - “同性恋” - 约翰·史密斯

4 - “欠我 20 美元” - 珍妮yyyyyyyy

一切正常。使用数据库，我可以创建列出特定“人”的所有“评论”的页面。然而，他对数据库未标准化感到困扰。我阅读了有关标准化的内容，发现他错了。该表目前已标准化，因为评论 ID 是唯一的，并且规定了“评论”和“人物”。现在他坚持认为“人”应该有自己的桌子，因为它是“物”。我认为没有必要，因为尽管“人”确实是更大的容器（一个“人”可以有很多关于他们的“评论”），但数据库似乎运行得很好，“人”是评论 ID。我使用不同的 PHP 调用来执行不同的 SQL 选择，使其在输出以及用户搜索和查看结果的不同方式上神奇地显得更加复杂，但实际上，设置非常简单。我现在让用户对评论进行排序，并在同一张表上保留一个“分数”作为另一个字段。

我认为目前不需要为唯一的“人员”条目建立一个单独的表，因为“人员”没有自己的“分数”或任何自己的属性。只有评论才可以。我的朋友如此坚持认为这是为了效率所必需的。最后我说，“好吧，如果你想让我创建一个单独的表，并让“人”成为它自己的字段，那么第二个字段是什么？因为如果一个表只有一个列，那似乎毫无意义。我同意我们稍后可能会需要为“人”提供自己的桌子，但我们可以在那时处理这个问题。”然后他说字符串不能成为主键，我们会将当前表中的“persons”转换为数字，数字将成为新“person”表中的主键。对我来说，这似乎是不必要的，它会使当前的表格更难阅读。他还认为以后不可能创建第二个表，而且我们现在需要预测以后可能需要它。

谁是对的？

原文

My friend and I are building a website and having a major disagreement. The core of the site is a database of comments about 'people.' Basically people can enter comment and they can enter the person the comment is about. Then viewers can search the database for words that are in the comment or parts of the person name. It is completely user generated. For example, if someone wants to post a comment on a mispelled version of a person's name, they can, and that's OK. So there may be multiple spellings of different people listed as several different entries (some with middle name, some with nickname, some mispelled, etc.), but this is all OK. We don't care if people make comments about random people or imaginary people.

Anyway, the issue is about how we are structuring the database. Right now it is just one table with the comment ID as the primary key, and then there is a field for the 'person' the comment is about:

comment ID - comment - person

1 - "he is weird" - John Smith

2 - "smelly girl" - Jenny

3 - "gay" - John Smith

4 - "owes me $20" - Jennyyyyyyyyy

Everything is working fine. Using the database, I am able to create pages that list all the 'comments' for a particular 'person.' However, he is obsessed that the database isn't normalized. I read up on normalization and learned that he was wrong. The table IS currently normalized, because the comment ID is unique and dictates the 'comment' and the 'person.' Now he is insistant that 'person' should have it's OWN table because it is a 'thing.' I don't think it is necessary, because even though 'person' really is the bigger container (one 'person' can have many 'comments' about them), the database seems to operate just fine with 'person' being an attribute of the comment ID. I use various PHP calls for different SQL selections to make it magically appear more sophisticated on the output and the different way the user can search and see results, but in reality, the set-up is quite simple. I am now letting users rank comments with thumbs up and thumbs down, and I keep a 'score' as another field on the same table.

I feel that there is currently no need to have a separate table for just unique 'person' entries because the 'persons' don't have their own 'score' or any of their own attributes. Only the comments do. My friend is so insistant that it is necessary for efficiency. Finally I said, "OK, if you want me to create a separate table and let 'person' be it's own field, then what would be the second field? Because if a table has just a single column, it seems pointless. I agree that we may later create a need to give 'person' it's own table, but we can deal with that then." He then said that strings can't be primary keys, and that we would convert the 'persons' in the current table to numbers, and the numbers would be the primary key in the new 'person' table. To me this seems unnecessary and it would make the current table harder to read. He also thinks it will be impossible to create the second table later, and that we need to anticipate now that we might need it for something later.

Who is right?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苏别ゝ 2024-09-25 05:51:21

为 person 创建一个新表并使用该表的键代替 person 属性与规范化无关。出于其他原因，这可能是一个好主意，但这样做并不会让数据库比不这样做“更加规范化”。所以你是对的：就规范化而言，创建另一个表是不必要的。

回复收藏 0 原文

dawn曙光 2024-09-25 05:51:21

在我看来，你的朋友是对的。

人应该住在不同的桌子上，你应该尝试正常化。不过，不要做得太过分。

从长远来看，您可能希望对您的网站做更多的事情，比如说您想将多个文件附加到一个人（即图片），那么您将非常感谢标准化。

回复收藏 0 原文

忆依然 2024-09-25 05:51:21

我会投票给你的朋友。我喜欢标准化并为未来做好计划，即使你永远不需要它，这种标准化也很容易做到，几乎不需要时间。您可以创建一个查询视图，以使您的 SQL 更简洁，并且无需您自己连接表。

回复收藏 0 原文

哆兒滾 2024-09-25 05:51:21

如果你已经达到了所有的能力并且没有扩展能力的计划，我认为你就保持原样。

如果您计划添加更多内容，即允许人们拥有帐户或其他任何内容，我认为将数据分成“人员”、“评论”表可能是明智的。这并不难，并且可以更轻松地扩展您的功能。

回复收藏 0 原文

嘿看小鸭子会跑 2024-09-25 05:51:21

你说得对。

Person 可能是一般事物，但不在您的模型中。如果您想让人们正确识别他们正在谈论的人，则需要一个 Person 表。例如，如果评论仅涉及已在数据库中注册的人员。

但这里看起来你有一个非结构化数据，没有身份；并且没有人有兴趣确定“jenny”和“jennyyy”实际上是否是同一个人，更不用说“jenny doe”和“我的表弟”......

回复收藏 0 原文

℡Ms空城旧梦 2024-09-25 05:51:21

嗯，有两种思想流派。有人说，以尽可能最规范化的方式创建数据模型，如果需要更高的效率，则进行反规范化。另一个基本上是“做工作所需的最少工作，然后随着需求的变化而改变”。也称为 YAGNI（您不会需要它）。

这完全取决于你对这件事的看法。如果这就是全部，那么您的方法可能没问题。如果您打算随着时间的推移通过新功能来改进它，那么您的朋友是对的。

回复收藏 0 原文

So尛奶瓶 2024-09-25 05:51:21

如果您从未打算将人员列与用户或其他任何内容相关联，并且数据显然不需要一致性或数据完整性检查，那么为什么这会出现在关系数据库中呢？这难道不是 nosql 数据库的一个用例吗？或者我错过了什么？

回复收藏 0 原文

寻找我们的幸福 2024-09-25 05:51:21

规范化是关于函数依赖（FD）的。您需要识别所有
在数据模型完全标准化之前，FD 存在于数据模型的属性中。

让我们回顾一下您所拥有的内容：

CommentId 的任何给定实例都可以在功能上确定 Person（FD：CommentId -> Person)
CommentId 的任何给定实例在功能上确定 Comment（FD：CommentId -> Comment）
在功能上确定 UserId（FD：CommentId -> UserId）
CommentId 的任何给定实例 CommentId 的功能决定了 Score（FD：CommentId -> Score）

这里的所有内容都是依赖属性关于 CommentId 和
单独的CommentId。这可能会让您相信包含全部或子集的关系（表）
上述属性必须标准化。

首先要问自己的是，为什么要创建 CommentId 属性？严格来说，
这是一个制造的属性——它与任何“真实”的东西无关。评论ID是
通常称为代理键。代理键只是一个代表的虚构值
对应于其他一些属性组的唯一值集。那么CommentId是哪组属性呢？
的代理人？我们可以算出
通过提出以下问题并向模型添加新的 FD：

1) 注释必须是唯一的吗？如果是这样，FD：Comment -> CommentId 必须为 true。
2）只要是针对不同的人，同一个评论可以多次发表吗？如果是这样，那么
FD：人物 + 评论 -> CommentId 必须为 true，并且上面 1 中的 FD 为 false。
3) 是否可以对同一个人多次发表相同的评论，前提是该评论是由
不同的UserId？如果是这样，则 1 和 2 中的 FD 不可能为真，但
FD：人员 + 评论 + 用户ID -> CommentId 可能是 true。
4) 是否可以通过相同的 UserId 对同一个人进行多次相同的评论，但是
有不同的分数吗？这意味着 FD：Person + Comment + UserId' + Score -> CommentId 为 true，其他为 false。

上述 4 个 FD 中必须有一个为真。无论哪种都会影响数据模型的标准化方式。

假设FD：Person + Comment + UserId -> CommentId 结果是正确的。逻辑性
后果是：

Person + Comment + UserId 和 CommentId 作为相对于 Score 的等效键
Score 应与其一个键而不是两个键建立关系（以避免传递依赖）。
显而易见的选择是 CommentId，因为它是专门作为代理创建的。
需要一个由以下组成的关系：CommentId、Person、Comment、UserId 来绑定
其替代品的关键。

从理论上来说，代理键CommentId不是
使您的数据模型或数据库正常工作所需的。然而，它的存在可能会影响关系的构建方式。

代理键的创建是一个具有一定重要性的实际问题。
考虑一下如果您选择不使用代理键而是使用完整的代理键，可能会发生什么情况
属性集 Person + Comment + UserId 代替，特别是在需要时
在多个表上作为外键或主键：

注释可能会增加大量空间开销
到您的数据库，因为它在多个表中重复。它可能超过几个字符长。
如果有人选择编辑评论会发生什么？这种变化需要传播
对于 Comment 是键的一部分的所有表。不是什么美丽的景象！
对长的复杂键建立索引可能会占用大量空间和/或导致更新性能变慢

分配给代理键的值永远不会改变，无论您对值做什么
与其确定的属性相关联。现在更新依赖属性
仅限于定义代理键的一张表。这具有巨大的现实意义。

现在回到是否应该为 Person 创建代理。 Person 还活着吗
在许多或任何 FD 的左侧？如果是这样，它的价值将通过您传播
数据库，并且有一个为其创建代理的情况。 Person 是文本属性还是数字属性与创建代理键的选择无关。

根据你所说的，最多只有一个微弱的论据来创建一个
Person 的代理。这个论点是基于这样的怀疑：它的值可能在未来的某个时刻成为密钥或密钥的一部分。

Normalization is all about functional dependencies (FD's). You need to identify all of the
FD's that exist among the attributes of your data model before it can be fully normalized.

Lets review what you have:

Any given instance of a CommentId functionally determines the Person (FD: CommentId -> Person)
Any given instance of a CommentId functionally determines the Comment (FD: CommentId -> Comment)
Any given instance of a CommentId functionally determines the UserId (FD: CommentId -> UserId)
Any given instance of a CommentId functionally determines the Score (FD: CommentId -> Score)

Everything here is a dependant attribute on CommentId and
CommentId alone. This might lead you to the belief that a relation (table) containing all of, or a subset of, the
above attributes must be normalized.

First thing to ask yourself is why did you create the CommentId attribute anyway? Strictly speaking,
this is a manufactured attribute - it does not relate to anything 'real'. CommentId is
commonly referred to as a surrogate key. A surrogate key is just a made up value that stands in
for a unique value set corresponding to some other group of attributes. So what group of attributes is CommentId
a surrogate for? We can figure that
out by asking the following questions and adding new FD's to the model:

1) Does a Comment have to be unique? If so the FD: Comment -> CommentId must be true.
2) Can the same Comment be made multiple times as long as it is about a different Person? If so, then
FD: Person + Comment -> CommentId must be true and the FD in 1 above is false.
3) Can the same Comment be made multiple times about the same Person provided it was made by
different UserId's? If so, the FDs in 1 and 2 cannot be true but
FD: Person + Comment + UserId -> CommentId may be true.
4) Can the same Comment be made multiple times about the same Person by the same UserId but
have different Scores? This implies FD: Person + Comment + UserId' + Score -> CommentId is true and the others are false.

Exactly one of the above 4 FD's above must be true. Whichever it is affects how your data model is normalized.

Suppose FD: Person + Comment + UserId -> CommentId turns out to be true. The logical
consequences are that:

Person + Comment + UserId and CommentId serve as equivalent keys with respect to Score
Score should be put in a relation with one but not both of its keys (to avoid transitive dependencies).
The obvious choice would be CommentId since it was specifically created as a surrogate.
A relation comprised of: CommentId, Person, Comment, UserId is needed to tie the
Key to its surrogate.

From a theoretical point of view, the surrogate key CommentId is not
required to make your data model or database work. However, its presence may affect how relations are constructed.

Creation of surrogate keys is a practical issue of some importance.
Consider what might happen if you choose to not use a surrogate key but the full
attribute set Person + Comment + UserId in its place, especially if it was required
on multiple tables as a foreign or primary key:

Comment might add a lot of space overhead
to your database because it is repeated in multiple tables. It is probably more than a couple of characters long.
What happens if someone chooses to edit a Comment? That change needs to be propagated
to all tables where Comment is part of a key. Not a pretty sight!
Indexing long complex keys can take a lot of space and/or make for slow update performance

The value assigned to a surrogate key never changes, no matter what you do to the values
associated to the attributes that it determines. Updating the dependant attributes is now
limited to the one table defining the surrogate key. This is of huge practical significance.

Now back to whether you should be creating a surrogate for Person. Does Person live
on the left hand side of many, or any, FDs? If it does, its value will propogate through your
database and there is a case for creating a surrogate for it. Whether Person is a text or numeric attribute is irrelevant to the choice of creating a surrogate key.

Based on what you have said, there is at best a weak argument to create a
surrogate for Person. This argument is based on the suspicion that its value may at some point become a key or part of a key at some point in the future.

回复收藏 0 原文

请叫√我孤独 2024-09-25 05:51:21

事情是这样的。每当你创造一些东西时，你都想确保它有成长的空间。您想要尝试预测您的计划的未来项目和未来进展。在这种情况下，您说得对，当前不需要添加仅包含 1 个字段的人员表（不计算 ID，假设您有一个 int ID 字段和一个人名）。然而，将来，您可能希望为这些人提供其他属性，例如名字、姓氏、电子邮件地址、添加日期等。

虽然过度规范化肯定是有害的，但我个人会创建另一个更大的表来保存具有附加字段的人，以便我将来可以轻松添加新功能。

回复收藏 0 原文

﹎☆浅夏丿初晴 2024-09-25 05:51:21

每当你与用户打交道时，都应该有一张专用的桌子。然后您可以连接表并引用该用户的 ID。

user -> id | username | password | email

comment -> id | user_id | content

SQL 将评论加入用户：

SELECT user.username, comment.content FROM user JOIN comment WHERE user.id = comment.user_id;

将来当您想要查找有关特定用户的信息时，这将使事情变得更加容易。额外的努力可以忽略不计。

关于每个评论的“分数”，也应该是一个单独的表。这样您就可以将用户与“喜欢”或“不喜欢”联系起来。

Whenever you're dealing with users, there should be a dedicated table. Then you can just join the tables and refer to that user's ID.

user -> id | username | password | email

comment -> id | user_id | content

SQL to join the comments to the users:

SELECT user.username, comment.content FROM user JOIN comment WHERE user.id = comment.user_id;

It'll make it so much easier in the future when you want to find information about that specific user. The amount of extra effort is negligible.

Concerning the "score" for each comment, that should also be a separate table as well. That way you can connect a user to a "like" or "dislike."

回复收藏 0 原文

冷夜 2024-09-25 05:51:21

有了这个数据库，你可能觉得还可以，但是以后你想让用户从数据库中了解更多的时候可能会出现一些问题。假设你想知道一个名为=的人的评论数量'abc'。在这种情况下，您将必须遍历整个评论表并继续计数。取而代之的是，您可以为每个人设置一个名为“count”的属性，并在对该人发表评论时递增该属性人。
就规范化而言，拥有规范化的数据库总是更好，因为它减少了冗余并使数据库直观易懂。如果您预计您的数据库将来会变得很大，那么必须进行规范化。

回复收藏 0 原文

~没有更多了~