什么时候是打破标准化规则的好时机?

发布于 2024-08-18 01:15:29 字数 65 浏览 6 评论 0原文

请就不正常化的情况给出您的看法。我刚刚目睹了一位架构师和 DBA 之间的一些激烈讨论,他们坚持认为数据库过于标准化。

Please give your opinion of situations in which it is a good idea to NOT normalize. I just witnessed some heated discussions between an architech and DBA who insisted in which one was arguing the database was TOO normalized.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

誰認得朕 2024-08-25 01:15:29

规则是规范化直到它受到伤害,然后非规范化直到它起作用。 (谁说的?)

一般来说,当我有很多父子关系时,我经常进行非规范化,并且我知道我经常需要连接到五个或六个大表才能获取一条数据(例如客户端 ID)并且大部分时间不需要中间表中的任何信息。如果可能的话,我会尝试对不会经常更改的内容(例如 id 字段)进行非规范化。但每当您进行非规范化时,您都必须编写触发器或其他一些进程(但如果不能通过 PK/FK 关系和级联更新来处理,通常会触发)以确保数据保持同步。如果您未能在数据库级别执行此操作,那么您将遇到数据完整性问题,并且您的数据将变得无用。不要认为您可以通过应用程序代码来维护非规范化。这是灾难的根源,因为数据库经常从应用程序以外的地方更新。

正确的反规范化会减慢插入、更新和删除的速度,特别是当您需要处理大批量数据时。它可能会也可能不会提高选择查询速度,具体取决于您需要如何查询数据。如果您最终需要进行大量自连接来获取数据,那么最好不要进行非规范化。在没有测试看看性能是否有所提高的情况下,切勿进行反规范化。请记住,当许多用户使用系统时,减慢插入/更新/删除将对系统产生整体影响。通过反规范化来解决一个问题,您可能会在整个系统中引入更严重的问题。不要只测试您想要加速的一个查询,而是测试整个系统的性能。您可能会加快每月运行一次的查询的速度,并减慢每天运行数千次的其他查询的速度。

非规范化通常针对数据仓库进行,这是一种特殊情况,因为它们通常按计划自动更新,而不是由用户一次更新一条记录。专门从事数据仓库的 DBA 也倾向于构建它们,并且他们知道如何避免数据完整性问题。

另一种常见的非规范化技术是为与不需要使用实时数据运行的复杂报告相关的数据创建临时表。这是一种穷人的数据仓库,如果没有按计划更新暂存表的方法,就永远不应该这样做(尽管您很少能逃脱惩罚,但它使用的服务器资源在大多数情况下可以更好地花在其他地方。 )通常,当系统上的用户很少时,这些类型的表就会更新,并且落后于实时数据一整天。除非您暂存数据的查询确实很慢并且无法进行优化,否则不要考虑这样做。许多缓慢的查询可以在不进行非标准化的情况下进行优化,因为开发人员通常使用最容易理解的方式而不是最高效的方式来选择数据。

The rule is normalize til it hurts, then denormalize til it works. (who said that?)

In general, I often denormalize when I have a lot of parent child relationships and I know I would often have to join to five or six large tables to get one piece of data (say the client id for instance) and will not need any of the information from the intermediate tables much of the time. If at all possible, I try to denormalize things that will not change frequently (such as id fields). But anytime you denormalize, you have to write triggers or some other process (but normally triggers if it isn't something that can be handled through a PK/FK relationship and cascading updates) to make sure the data stays in synch. If you fail to do this at the database level, then you will have data integrity problems and your data becomes useless. Do not think you can maintain the denormalization through the application code. This is a recipe for disaster, as database are updated often from places other than the application.

Denormalizing correctly can slow inserts, updates and deletes, especially if you need to do large batches of data. It may or may not improve select query speed depending on how you need to query the data. If you end up needing to do a lot of self-joins to get the data, it is possible you would have been better off not denormalizing. Never denormalize without testing to see if you have improved performance. Remember slowing inserts/updates/deletes will have an overall effect on the system when many users are using it. By denormalizing to fix one problem, you may be introducing a worse problem in the overall system. Don't just test the one query you are trying to speed up, test the performance of the whole system. You might speed up a query that runs once a month and slow down other qreries that run thousands of times a day.

Denormalizing is often done for data warehouses which are a special case as they are generally updated automatically on a schedule rather than one record at a time by a user. DBAs who specialize in data warehousing also tend to build them and they know how to avoid the data integrity issues.

Another common denormalizing technique is to create a staging table for data related to a complex report that doesn't need to be run with real time data. This is a sort of poor man's data warehouse and should never be done without a way to update the staging table on a schedule (As infrequently as you can get away with, this uses server resources that could be better spend elsewhere most of the time.) Often these types of table are updated when there are few users on the system and lag a full day behind the real time data. Don't consider doing this unless the query you are staging the data for is truly slow and cannot otherwise be optimized. Many slow queries can be optimized without denomalization as developers often use the easiest to understand rather than the most performant ways to select data.

泼猴你往哪里跑 2024-08-25 01:15:29

性能和查询可能性:如果数据库过于规范化,这可能会导致查询中出现大量联接,并限制您搜索特定属性的可能性。
在进行数据库设计时,您应该通过访问路径分析来考虑您计划使用它的方式。

详细地说,经验法则是对经常更新的数据进行规范化,并对经常读取的数据进行非规范化。

Performance an query possibilities: If the DB is too normalized, this may result in a lot of joins in your queries, and limits your possibilities to search on specific attributes.
When doing DB design, you should consider the way you plan to use it, by doing access path analysis.

To elaborate, It's a rule of thumb to normalize data that is updated frequently, and denormalize data that is mostly read.

久而酒知 2024-08-25 01:15:29

当您过早地进行优化时,

可以进行一些标准化以允许未来的增长,但您可能不需要它。

例如,假设您有一个 person 表。您可以将birthday 作为一列,因为每个人只有一个生日。

如果您严格标准化,则不会将 phone_numbercell_numberfax_number 作为 person 中的列,但可能有一个 phonenumber 表,其中每行都有一个号码、一个类型和一个 person_id 关系。这比仅仅在 person 表中添加新列更好,因为

  1. 很多人不会拥有所有这些,会留下很多空白,并且
  2. 如果某人有 3 个单元格号码,那么您会添加诸如 cell_number_2 之类的尴尬列

问题#1 是有效的,但问题#2 可能是“您不需要它”的示例。可以说“我们只允许使用一个手机号码,仅此而已”。在这种情况下,您可能不会费心为电话号码创建单独的表。

这是一个权衡。如果不创建单独的表,则不会严格规范化,并且可能会有很多 NULL 空格。但您要做的连接也更少,而且工作量也更少。

像许多良好实践一样,正常化本身就可以成为目的 - 您私下给自己的金徽章,因为您做对了。那很好。 但是,认识到有时为了保持事情简单而可以改变规则是件好事。

最后一件事:您必须权衡这样一个事实:一旦代码启动,更改数据库架构很糟糕和跑步。因此,可以说“我们不需要它”,但在做出承诺之前要尽量确定。

When you're optimizing prematurely

Some normalization is there to allow future growth, and you may not need it.

For example, imagine you have a person table. You can have birthday as a column, because each person will only ever have one birthday.

If you're strictly normalizing, you wouldn't have phone_number and cell_number and fax_number as columns in person, but might instead have a phonenumber table where each row has a number, a type, and a person_id relation. This can be better than just sticking new columns in the person table, because

  1. Many people will not have all of those, leaving a lot of blank spaces, and
  2. If someone has 3 cell numbers, you'd be adding awkward columns like cell_number_2

Concern #1 is valid, but concern #2 may be an example of "You ain't gonna need it." It's valid to say "we only allow one cell number, and that's that." In that case, you might not bother making a separate table for phone numbers.

It's a tradeoff. By not making the separate table, you're not strictly normalizing, and you might have a lot of NULL spaces. But you also have fewer joins to do, and it's just less work.

Like many good practices, normalization can become an end in itself - a gold badge you give yourself privately, because you Did It Right. And that's fine. But it's good to realize that the rules can be bent sometimes in the interest of keeping things simple.

One last thing: you have to weigh the fact that it sucks to change your db schema once your code is up and running. So it's OK to say "we ain't gonna need it," but try to be pretty sure before you commit.

春夜浅 2024-08-25 01:15:29

关于存储与性能已经有几个很好的答案,但除此之外,我还想补充一点,对我来说,应该考虑反规范化的另一个指示是,您将需要使用自连接进行查询。

从概念上讲,自连接表当然没有任何问题,但经验表明,对于没有经验的程序员来说,这是更难掌握的概念之一,因此往往会产生错误。如果您能够设计出这些需求,您很可能会简化未来的维护路径。

这当然是一个判断问题,也是一个指示,而不是规则。

Several good answers already on storage vs performance, but in addition I'd add the nugget that one further indication for me that de-normalization should be considered is where you're going to require queries using self-joins.

Conceptually of course there's nothing wrong with self-joining tables, but experience shows that it's one of the more difficult concepts for inexperienced programmers to grasp and consequently tends to produce bugs. You'll most likely ease the future maintenance path if you can design the need for these out.

It's a matter of judgement of course and so is an indication, not a rule.

子栖 2024-08-25 01:15:29

你必须找到最佳点...由于过于规范化,你最终会得到很多“臃肿”的抽象结构,其中只包含 1 或 2 列数据,并且你最终会为大多数查询连接 5 个表。

由于标准化不足,您最终会在很多不同的地方拥有大量数据。由于缓存大小等原因,这可能会导致数据库变慢。另外,现在当您需要更新某些内容时,您需要更新 4 个不同的表,而不是 1 个,甚至不让我开始确保不同位置的所有数据都匹配!

基本上,选择你的毒药,观察你的数据库将如何使用,并保持理智。所有规则都是为了被打破的,如果你在两个经常访问的地方有一段数据,我会说那没关系。这是一个关键部分,连接(可能超过 2 个)表的成本可能太高。但也不要针对数据库的空间或速度进行微观优化。

You have to find the sweet spot... Being too normalized, you end up with a lot of "bloaty" abstract structures which contain just 1 or 2 columns of data and you end up joining 5 tables for most queries.

Being under-normalized, you end up having a lot of data in a lot of different places. This can cause the DB to be slower because of cache size and such. Also, now when you need to update something, you have 4 different tables to update instead of 1, and don't even get me started on making sure all the data in different places matches up!

Basically, pick your poison, and watch how your DB will be used, and be sane about it. All rules were meant to be broken, and if you have a piece of data in maybe 2 places that are very commonly accessed, I would say that's ok. It's a critical piece where joining (possibly more than 2) tables can just be too expensive.. But also don't micro-optimize for space or speed of the database.

任性一次 2024-08-25 01:15:29

在一个他们过于正常化的地方工作。他们从邮寄地址表中删除了“州”列。他们放置了链接到状态表的整数外键字段来代替 2 字节状态列。

总之:

  • 他们用地址中的 4 字节列替换了 2 字节状态列
    桌子。现在,每行多占用 2 个字节的存储空间。

  • 他们添加了一个带有 4 字节主键列和 2 字节状态列的状态表。占用更多空间来存储此表。

  • 数据库在状态表中保留键的 btree 索引。占用更多空间。

  • 检索地址的sql比较难写。

  • 检索地址的sql比原来的慢。

当然,这会天真地删除重复的不变数据。结果是它使用更多的磁盘空间,更难使用,并且使用速度更慢。

你绝对可以标准化太多。

Worked in a place where they normalized too much. They removed the 'state' column from a mailing address table. In place of a 2 byte state column they put an integer foreign key field linking to a state table.

In summary:

  • They replaced a 2 byte state column with a 4 byte column in the address
    table. Now every row takes 2 more bytes of storage.

  • They added a state table with a 4 byte primary key column and a 2 byte state column. Takes up more space to store this table.

  • The database keeps an btree index of the keys in the state table. Takes up more space.

  • The sql to retrieve addresses is harder to write.

  • The sql to retrieve addresses is slower than the original.

Sure, this naively removes duplicated unchanging data. The result is that it uses more disk space, is harder to use, and is slower to use.

You definitely can normalize too much.

一枫情书 2024-08-25 01:15:29

规范化消除了冗余,但如果它以某种方式降低了性能(由于所有必需的连接),则硬件成本不足,那么是时候为了性能而允许冗余了。这是我的经验法则。回答时间长的情况下也是如此。

Normalization eliminates redundancy, but if it slows down performance in a way (due to all the required joins) the costs for the hardware are inadequate it is time to allow redundancy for performance sake. That's my rule of thumb. Same in case of long answer times.

蓝眼泪 2024-08-25 01:15:29

出于性能原因,数据仓库通常使用非标准化方法。根据维基百科

数据库设计指南的一个标准是设计者应该创建一个完全规范化的设计;出于性能原因,随后可以执行选择性反规范化。然而,一些建模学科,例如数据仓库设计的维度建模方法,明确推荐非标准化设计,即在很大程度上不遵循 3NF 的设计。

Data warehousing often uses a non-normalized approach for performance reasons. Per wikipedia:

A standard piece of database design guidance is that the designer should create a fully normalized design; selective denormalization can subsequently be performed for performance reasons. However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.

ぃ双果 2024-08-25 01:15:29

报告和数据仓库可能是您发现数据非规范化的最大地方。 OLAP 系统通常总是非规范化为单个表或一组表。

Reporting and data warehousing are probably the biggest places you'll find data denormalized. OLAP systems will typically always denormalize into a single table or set of tables.

帅气尐潴 2024-08-25 01:15:29

当您遵循的设计规则会导致架构未完全规范化时,请不要进行规范化。其中一种设计原则是星型模式设计,它的一个紧密变体是雪花模式。

星形和雪花都会产生一个更容易用于各种报告、自定义提取以及与 OLAP 工具(例如 Cognos Power Play)的接口的模式。缺点是什么?在插入/更新/删除数据时,每次偏离一种正常形式(1NF 除外)都会带来异常。如果你真的了解正常形式,你就会知道相关的异常是什么。当您编写 ETL(提取、转换和加载)过程以使星形/雪花保持最新时,您必须解决这些异常情况。

那么什么时候星形或雪花模式比规范化模式更好呢?通常用于数据仓库、数据集市和报告数据库。在我自己的实践中,我从未构建过不属于 OLTP 数据库后端的数据库,并且 OLPT 数据库受益于几乎完全标准化。不要只是非规范化并放弃所有纪律。这就像随机设计一样。

Don't normalize when you are following a design discipline that results in a less than fully normalized schema. One such design discipline is star schema design, and a close variant of it is snowflake schema.

Both star and snowflake will result in a schema that's far easier to use for a variety of reports, customized extracts, and interface to an OLAP tool such as Cognos Power Play. The downside? Every departure from one of the normal forms (except 1NF) carries with it an anomaly when inserting/updating/deleting data. If you really know the normal forms, you'll know what the related anomalies are. When you write the ETL (extract, transform, and load) procedures to keep your star/snowflake current, you'll have to work around these anomalies.

So when is a star or snowflake schema a better idea than a normalized schema? Typically, for data warehouses, data marts, and reporting databases. In my own practice, I've never built one of these that was not a back end for an OLTP database, and the OLPT database benefits from nearly complete normalization. Don't just denormalize and abandon all discipline. That's like designing at random.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文