正常化要走多远?

发布于 2024-07-13 11:47:09 字数 1431 浏览 9 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

半岛未凉 2024-07-20 11:47:09

标准化直到有问题,然后非标准化直到有效

Normalize till it hurts, then denormalize till it works

这样的小城市 2024-07-20 11:47:09

非规范化的优点是在大型查询上可以快速选择。

缺点是:

  • 需要更多的编码和时间来确保完整性(这对您的情况来说最重要)

  • 它是DML 上较慢(INSERT/UPDATE/DELETE)

  • 它需要更多空间

至于优化,您可以优化其中一个以获得更快的速度查询或更快的 DML(通常,这两者是对立的)。

优化更快的查询通常意味着复制数据,无论是非规范化、索引、额外的表等等。

如果是索引,RDBMS 会为您完成此操作,但如果是非规范化,您将需要自己编码。 如果部门搬到另一个办公室怎么办? 您需要在三张表而不是一张表中修复它。

因此,正如我从表的名称中看到的那样,那里不会有数百万条记录。 所以你最好规范化你的数据,这样会更容易管理。

Denormalization has the advantage of fast SELECTs on large queries.

Disadvantages are:

  • It takes more coding and time to ensure integrity (which is most important in your case)

  • It's slower on DML (INSERT/UPDATE/DELETE)

  • It takes more space

As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).

Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.

In case of indices, the RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department moves to another Office? You'll need to fix it in three tables instead of one.

So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.

眼前雾蒙蒙 2024-07-20 11:47:09

始终根据需要进行规范化,以消除数据库完整性问题(即潜在的重复或丢失数据)。

即使非规范化可以带来性能提升(通常情况并非如此),但失去数据完整性的代价也太高,无法证明其合理性。

只要问问那些必须致力于解决遗留数据库中所有模糊问题的人,他们是否更喜欢良好的数据或微不足道的(如果有的话)速度提升。

另外,正如约翰所提到的 - 如果您最终需要非规范化数据(用于速度/报告等),则在单独的表中创建它,保留原始数据。

Always normalize as far as necessary to remove database integrity issues (i.e. potential duplicated or missing data).

Even if there were performance gains from denormalizing (which is usually not the case), the price of losing data integrity is too high to justify.

Just ask anyone who has had to work on fixing all the obscure issues from a legacy database whether they would prefer good data or insignificant (if any) speed increases.

Also, as mentioned by John - if you do end up needing denormalised data (for speed/reporting/etc) then create it in a separate table, preserving the raw data.

迷路的信 2024-07-20 11:47:09

连接成本本身不应该让您太担心(除非您试图扩展到数百万用户,在这种情况下您绝对应该担心)。

我更关心对调用它的代码的影响。 规范化的数据库更容易编程,并且几乎总能提高应用程序本身的效率。

也就是说,不要超出理性范围进行正常化。 我见过为了标准化而标准化,它通常最终会出现在一个数据库中,该数据库有一个或两个实际数据表,以及 20 个只填充外键的表。 这显然是多余的。 我通常使用的规则是:如果列中的数据会重复,则应该对其进行标准化。

The cost of joins shouldn't worry you too much per se (unless you're trying to scale to millions of users, in which case you absolutely should worry).

I'd be more concerned about the effect on the code that's calling this. Normalized databases are much easier to program against, and almost always lead to better efficiency within the application itself.

That said, don't normalize beyond the bounds of reason. I've seen normalization for normalization's sake, which usually ends up in a database that has one or two tables of actual data, and 20 tables filled with nothing but foreign keys. That's clearly overkill. The rule I normally use is: If the data in a column would otherwise be duplicated, it should be normalized.

伤感在游骋 2024-07-20 11:47:09

最好将该模式保留为第三范式,并让 DBA 抱怨连接成本。

It is better to keep that schema in Third Normal Form and let your DBA to complain about joins cost.

德意的啸 2024-07-20 11:47:09

如果您的数据库一开始就没有正确规范化,DBA 应该担心。 在仔细测量性能并确定存在瓶颈后,您可能会开始反规范化,但我会非常谨慎。

DBA's should be concerned if your db is not properly normalized to begin with. After you carefully measured performance and determined you have bottlenecks you may start denormalizing, but I would be extremely cautious.

最单纯的乌龟 2024-07-20 11:47:09

我最担心的是 DBA 警告您连接的成本,除非您处于高度病态的情况。

I'd be most concerned about DBAs who are warning you about the cost of joins, unless you're in a highly pathological situation.

风尘浪孓 2024-07-20 11:47:09

在尝试过其他所有方法之前,您不应该考虑非规范化。

这真的是性能问题吗?
您的数据库是否有任何功能可以用来加快速度而不损害完整性?
可以通过缓存提高性能吗?

You shouldn't look at denormalizing before you've tried everything else.

Is the performance of this really an issue?
Do your database have any features you can use to speed things up without compromising integrity?
Can you increase your performance by caching?

骄傲 2024-07-20 11:47:09

规范化以对设计中的概念及其关系进行建模。 想想哪些关系可以改变,以及这样的改变对你的设计意味着什么。

在您发布的架构中,在我看来有一个明显的错误(如果您的组织工作方式有特殊情况,这可能不是错误)——有一个隐含的假设,即每个部门都完全处于一个办公室,同一部门的所有员工都在该办公室工作。

如果该部门占用两个办公室怎么办?

如果一名员工名义上属于一个部门,但在不同的办公室工作(假设您指的是实体办公室)怎么办?

Normalize to model the concepts in your design, and their relationship. Think of what relationships can change, and what a change like that will mean in terms of your design.

In the schema you posted, there is what looks to me like a glaring error (which may not be an error if you have a special case in terms of how your organization works) -- there is an implicit assumption that every department is in exactly one office, and that all the employees who are in the same department work at that office.

What if the department occupies two offices?

What if an employee nominally belongs to one department, but works out of a different office (assuming you are referring to physical offices)?

猫性小仙女 2024-07-20 11:47:09

不要非规范化。

根据简单而合理的设计原则设计您的表,这将使您轻松实现系统的其余部分。 易于构建、填充、使用和管理数据库。 可以轻松快速地运行查询和更新。 当情况需要时,可以轻松修改和扩展表格设计,并且由于轻微和短暂的原因而没有必要这样做。

一套设计原则是标准化。 规范化使表可以轻松快速地更新(包括插入和删除)。 规范化消除了更新异常,并消除了数据库自相矛盾的可能性。 这通过使错误变得不可能来防止大量错误。 它还通过使更新瓶颈变得不必要来防止大量更新瓶颈。 这很好。

还有其他几套设计原则。 它们导致表格设计未完全标准化。 但这不是“非规范化”。 这只是一种不同的设计,与规范化有些不兼容。

导致与规范化完全不同的设计的一组设计原则是星型模式设计。 星型模式的查询速度非常快。 只要有良好的 DBMS、良好的物理设计和足够的硬件来完成工作,即使是大规模的连接和聚合也可以在合理的时间内完成。 正如您所料,星型模式会出现更新异常。 当您保持数据库最新时,您必须针对这些异常情况进行编程。 您通常需要一个严格控制且精心构建的 ETL 流程,用于从其他(可能是规范化的)数据源更新星型模式。

使用存储在星型模式中的数据非常容易。 使用某种 OLAP 和报告引擎非常简单,您无需编写任何代码即可获取所需的所有信息,并且不会过多牺牲性能。

需要良好且深入的数据分析来设计良好的标准化模式。 数据分析中的错误和遗漏可能会导致未发现的功能依赖性。 这些未被发现的 FD 将导致无意中偏离正常化。

它还需要良好且深入的数据分析来设计和构建良好的星型模式。 数据分析中的错误和遗漏可能会导致维度和粒度的不幸选择。 这将使 ETL 几乎不可能建立,并且/或使得恒星的信息承载能力不足以满足新兴的需求。

良好且深入的数据分析不应成为分析瘫痪的借口。 分析必须在短时间内正确且合理地完成。 对于较小的项目,时间较短。 设计和实现应该能够经受住对数据分析和需求的一些后期添加和更正,但不能经受住不断的需求修订。

这个回答扩展了您原来的问题,但我认为它与数据库设计者相关。

Don't denormalize.

Design your tables according to simple and sound design principles that will make it easy to implement the rest of your system. Easy to build, populate, use, and administer the database. Easy and fast to run queries and updates against. Easy to revise and extend the table design when the situation calls for it, and unnecessary to do so for light and transient reasons.

One set of design principles is normalization. Normalization leads to tables that are easy and fast to update (including inserts and deletes). Normalization obviates update anomalies, and obviates the possiblity of a database that contradicts itself. This prevents a whole lot of bugs by making them impossible. It also prevents a whole lot of update bottlenecks by making them unnecessary. This is good.

There are other sets of design principles. They lead to table designs that are less than fully normalized. But that isn't "denormalization". It's just a different design, somewhat incompatible with normalization.

One set of design principles that leads to a radically different design from normalization is star schema design. Star schema is very fast for queries. Even large scale joins and aggregations can be done in a reasonable time, given a good DBMS, good physical design, and enough hardware to get the job done. As you might expect, a star schema suffers update anomalies. You have to program around these anomalies when you keep the database up to date. You will will generally need a tightly controlled and carefully built ETL process that updates the star schema from other (perhaps normalized) data sources.

Using data stored in a star schema is dramatically easy. It's so easy that using some kind of OLAP and reporting engine, you can get all the information needed without writing any code, and without sacrificing performance too much.

It takes good and somewhat deep data analysis to design a good normalized schema. Errors and omissions in data analysis may result in undiscovered functional dependencies. These undiscovered FDs will result in unwitting departures from normalization.

It also takes good and somewhat deep data analysis to design and build a good star schema. Errors and ommissions in data analysis may result in unfortunate choices in dimensions and granularity. This will make ETL almost impossible to build, and/or make the information carrying capacity of the star inadequate for the emerging needs.

Good and somewhat deep data analysis should not be an excuse for analysis paralysis. The analysis has to be right and reasonably complete in a short amount of time. Shorter for smaller projects. The design and implementation should be able to survive some late additions and corrections to the data analysis and to the requirements, but not a steady torrent of requirements revisions.

This response expands on your original question, but I think it's relevant for the would be database designer.

独享拥抱 2024-07-20 11:47:09

标准化是一个质量决策。

非规范化是一项性能决策。

这就是为什么 -

正常化直到疼痛为止; 反规范化直到它起作用。


质量决策告诉您哪种是您可以接受的最不规范的形式:

  1. 非冗余对您的表有多重要?
  2. 您想要多快的数据管理速度?
  3. 您希望表之间的关系有多清晰?

性能决策告诉我们可接受的最高范式:

  1. 我的数据库的响应足够快吗?
  2. 太多连接是否会导致速度变慢?

当您确定了您的情况下可接受的最低范式和最高范式后,请选择两者之间的任意范式。

Normalization is a quality decision.

Denormalization is a performance decision.

That's why -

Normalize till it hurts; De-normalize till it works.


Quality decisions tell which is the least Normal Form that you can live with:

  1. How much non-redundancy is important for your tables?
  2. How fast data management do you want?
  3. How clear do you want the relation between your tables?

Performance decisions tell what is the highest Normal Form acceptable:

  1. Is my database's response fast enough?
  2. Are too many joins causing a slowdown?

When you have fixed the least and the highest Normal Form acceptable in your case, pick the Normal Form anywhere in between.

可可 2024-07-20 11:47:09

如果您使用整数(或 BIGINT)作为 ID,并且它们是聚集主键,那么您应该没问题。

虽然从项目中找到办公室似乎总是更快,因为您总是查找主键,但在外键上使用索引将使差异最小,因为索引也将覆盖主键。

如果您稍后发现需要对数据进行非规范化,则可以按计划或触发器创建缓存表。

If you're using Integers (or BIGINT) as the ID's and they are the clustered primary key you should be fine.

Although it seems like it would always be faster to find an office from a project as you are always looking up primary keys the use of indexes on the foreign keys will make the difference minimal as the indexes will cover the primary keys too.

If you ever find a need later on to denormalise the data, you can create a cache table on a schedule or trigger.

遥远的绿洲 2024-07-20 11:47:09

在示例中,在表上正确设置的给定索引应该允许联接发生得非常快,并且可以很好地扩展到 100,000 行。 这通常是我用来解决这个问题的方法。

尽管有时数据被写入一次并在其余下的生命周期中被选择,但每次都进行十几个连接确实没有意义。

In the example given indexes set up properly on the tables should allow the joins to occur extremely fast and will scale well to the 100,000s of rows. This is usually the approach that I take to get around the issue.

There are times though that the data is written once and the selected for the rest of its life where it really didn't make sense to do a dozen joins each time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文