就数据库而言,是“为了正确性而规范化,为了性能而去规范化”。 正确的咒语?

发布于 2024-07-08 14:59:45 字数 1451 浏览 13 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

听风吹 2024-07-15 14:59:46

不要忘记,每次对数据库的一部分进行非规范化时,随着代码错误风险的增加,进一步适应它的能力就会下降,从而使整个系统的可持续性越来越差。

Don't forget that each time you denormalize part of your database, your capacity to further adapt it decreases, as risk of bugs in code increases, making the whole system less and less sustainable.

南风几经秋 2024-07-15 14:59:46

标准化与性能无关。 我真的无法比 Erwin Smout 在这个帖子中的说法更好了:
规范化数据库对资源有何影响?

大多数 SQL DBMS 对更改数据的物理表示且不影响逻辑模型的支持有限,因此不幸的是,这就是您可能发现有必要进行去规范化的原因之一。 另一个原因是许多 DBMS 对多表完整性约束没有良好的支持,因此作为实现这些约束的解决方法,您可能被迫将无关的属性放入某些表中。

Normalization has nothing to do with performance. I can't really put it better than Erwin Smout did in this thread:
What is the resource impact from normalizing a database?

Most SQL DBMSs have limited support for changing the physical representation of data without also compromising the logical model, so unfortunately that's one reason why you may find it necessary to demormalize. Another is that many DBMSs don't have good support for multi-table integrity constraints, so as a workaround to implement those constraints you may be forced to put extraneous attributes into some tables.

晌融 2024-07-15 14:59:46

数据库规范化不仅仅是为了理论上的正确性,它还可以帮助防止数据损坏。 我当然不会像 @aSkywalker 所建议的那样为了“简单性”而去规范化。 修复和清理损坏的数据绝非简单。

Database normalization isn't just for theoretical correctness, it can help to prevent data corruption. I certainly would NOT denormalize for "simplicity" as @aSkywalker suggests. Fixing and cleaning corrupted data is anything but simple.

花落人断肠 2024-07-15 14:59:46

您本身并不标准化“正确性”。 事情是这样的:

非规范化表具有提高性能的好处,但需要冗余和更多的开发人员脑力。

规范化表的优点是减少冗余并提高开发的简易性,但需要性能。

这几乎就像一个经典的平衡方程。 因此,根据您的需求(例如有多少数据正在攻击您的数据库服务器),除非确实需要,否则您应该坚持使用规范化表。 然而,从规范化到非规范化的开发比反规范化更容易、成本更低。

You don't normalize for 'correctness' per se. Here is the thing:

Denormalized table has the benefit of increasing performance but requires redundancy and more developer brain power.

Normalized tables has the benefit of reducing redundancy and increasing ease of development but requires performance.

It's almost like a classic balanced equation. So depending on your needs (such as how many that are hammering your database server) you should stick with normalized tables unless it is really needed. It is however easier and less costly for development to go from normalized to denormalized than vice versa.

强者自强 2024-07-15 14:59:46

决不。 请记住,您应该规范化的是您的关系(逻辑级别),而不是您的表(物理级别)。

No way. Keep in mind that what you're supposed to be normalizing is your relations (logical level), not your tables (physical level).

寻梦旅人 2024-07-15 14:59:46

非规范化数据更常见于未进行足够规范化的地方。

我的口头禅是“为了正确性而标准化,为了性能而消除”。 RDBM 是非常灵活的工具,但针对 OLTP 情况进行了优化。 用更简单的东西(例如带有事务日志的内存中的对象)替换 RDBMS 会有很大帮助。

Denormalized data is much more often found at places where not enough normalization was done.

My mantra is 'normalize for correctness, eliminate for performance'. RDBMs are very flexible tools, but optimized for the OLTP situation. Replacing the RDBMS by something simpler (e.g. objects in memory with a transaction log) can help a lot.

他夏了夏天 2024-07-15 14:59:46

我对这里的人的断言提出异议,即规范化数据库总是与更简单、更干净、更健壮的代码相关联。 确实,在很多情况下,完全规范化的代码比部分非规范化的代码更简单,但这充其量只是一个指导原则,而不是物理定律。

有人曾经将一个词定义为一个活生生的想法的外皮。 在计算机科学中,您可以说对象或表是根据问题的需求和现有基础设施来定义的,而不是理想对象的柏拉图式反映。 从理论上讲,理论和实践不会有什么区别,但在实践中,你确实会发现与理论的差异。 这句话对于 CS 来说特别有趣,因为该领域的重点之一就是找到这些差异并以尽可能最好的方式处理它们。

脱离数据库方面,看看编码方面,面向对象编程通过将许多密切相关的代码组合在一个对象类名称下,使我们免于意大利面条式编码的许多弊端。它具有易于记忆的英语含义,并且在某种程度上适合与其关联的所有代码。 如果太多的信息聚集在一起,那么最终每个对象都会变得非常复杂,这让人想起意大利面条代码。 如果将簇变小,那么如果不搜索大量对象,每个对象中的信息很少,就无法遵循逻辑线程,这被称为“通心粉代码”。

如果您考虑编程方面的理想对象大小与规范化数据库所产生的对象大小之间的权衡,我会对那些认为基于数据库进行选择通常更好的人表示认可然后在代码中解决该选择。 特别是因为在某些情况下您有能力通过 Hibernate 和类似技术的连接创建对象。 然而,我不会说这是一条绝对规则。 任何 OR 映射层的编写都是为了使最复杂的情​​况变得更简单,但可能会增加最简单情况的复杂性。 请记住,复杂性不是以大小为单位来衡量的,而是以复杂性为单位来衡量的。 那里有各种各样不同的系统。 有些预计会增长到几千行代码并永远保留在那里。 其他的则旨在成为公司数据的中央门户,理论上可以不受限制地向任何方向增长。 某些应用程序管理的数据每次更新都会被读取数百万次。 其他人则管理仅出于审计和临时目的而读取的数据。 一般来说,规则是:

  • 在中型或更大的应用程序中,当分割两侧的数据都可以修改并且潜在的修改彼此独立时,归一化几乎总是一个好主意。

  • 从单个表中更新或选择通常比使用多个表更简单,但是通过编写良好的 OR,可以将大部分数据模型空间的这种差异最小化。 使用直接 SQL,这对于单个用例来说几乎是微不足道的,尽管它是以非面向对象的方式进行的。

  • 代码需要保持相对较小以便于管理,实现这一点的一种有效方法是划分数据模型并围绕数据模型的各个部分构建面向服务的架构。 数据(反)规范化的最佳状态的目标应该在整体复杂性管理策略的范式中考虑。

在复杂的对象层次结构中,存在一些在数据库端看不到的复杂性,例如级联更新。 如果您使用对象所有权关系对关系外键和交叉链接进行建模,那么在更新对象时,您必须决定是否级联更新。 这可能比在 sql 中更复杂,因为一次执行某件事和始终正确执行某事之间的差异,有点像加载数据文件和为该类型文件编写解析器之间的差异。 在 C++、java 或其他语言中级联更新或删除的代码需要针对各种不同的场景做出正确的决策,而这种逻辑中的错误后果可能非常严重。 仍有待证明的是,这永远无法通过 SQL 方面的一点灵活性来简化,足以使任何 SQL 复杂性变得值得。

还有一点值得用规范化规则之一来描述。 数据库规范化的一个核心论点是数据重复总是不好的。 这通常是正确的,但不能盲目地遵循,特别是当解决方案的不同部分有不同的所有者时。 我曾经看到过一种情况,一组开发人员管理某种类型的事务,而另一组开发人员支持这些事务的可审计性,因此第二组开发人员编写了一个服务,每当事务发生时就会抓取多个表并创建一个非规范化的表。快照记录实际上说明了事务发生时系统的状态。 这个场景是一个有趣的用例(至少对于问题的数据重复部分),但它实际上是更大类别问题的一部分。 数据一致性需求通常会对数据库中的数据结构施加某些限制,从而使错误处理和故障排除更加简单,因为某些不正确的情况是不可能发生的。 然而,这也可能会产生“冻结”部分数据的影响,因为更改该数据子集将导致过去的事务在一致性规则下变得无效。 显然需要某种版本控制系统来解决这个问题,因此明显的问题是是否使用规范化版本控制系统(有效时间和过期时间)或基于快照的方法(截至事务时间的值)。 对于规范化版本,您不必担心快照方法的几个内部结构问题,例如:

  • 即使对于大型表,也可以有效地完成日期范围查询吗?
  • 是否可以保证日期范围不重叠?
  • 是否可以将状态事件追溯到操作员、交易或更改原因? (可能是的,但这是额外的开销)
  • 通过创建更复杂的版本控制系统,您是否让正确的所有者负责正确的数据?

我认为这里的最佳目标是不仅要了解理论上什么是正确的,还要了解为什么它是正确的,以及违规的后果是什么,然后当你在现实世界中时,你可以决定哪些后果是值得承担的还有哪些其他好处。 这才是设计的真正挑战。

I take issue with the assertion by folks here that Normalized databases are always associated with simpler, cleaner, more robust code. It is certainly true that there are many cases where fully normalized are associated with simpler code than partially denormalized code, but at best this is a guideline, not a law of physics.

Someone once defined a word as the skin of a living idea. In CS, you could say an object or table is defined in terms of the needs of the problem and the existing infrastructure, instead of being a platonic reflection of an ideal object. In theory, there will be no difference between theory and practice, but in practice, you do find variations from theory. This saying is particularly interesting for CS because one of the focuses of the field is to find these differences and to handle them in the best way possible.

Taking a break from the DB side of things and looking at the coding side of things, object oriented programming has saved us from a lot of the evils of spaghetti-coding, by grouping a lot of closely related code together under an object-class name that has an english meaning that is easy to remember and that somehow fits with all of the code it is associated with. If too much information is clustered together, then you end up with large amounts of complexity within each object and it is reminiscent of spaghetti code. If you make the clusters to small, then you can't follow threads of logic without searching through large numbers of objects with very little information in each object, which has been referred to as "Macaroni code".

If you look at the trade-off between the ideal object size on the programming side of things and the object size that results from normalizing your database, I will give a nod to those that would say it is often better to chose based on the database and then work around that choice in code. Especially because you have the ability in some cases to create objects from joins with hibernate and technologies like that. However I would stop far short of saying this is an absolute rule. Any OR-Mapping layer is written with the idea of making the most complex cases simpler, possibly at the expense of adding complexity to the most simple cases. And remember that complexity is not measured in units of size, but rather in units of complexity. There are all sorts of different systems out there. Some are expected to grow to a few thousand lines of code and stay there forever. Others are meant to be the central portal to a company's data and could theoretically grow in any direction without constraint. Some applications manage data that is read millions of times for every update. Others manage data that is only read for audit and ad-hoc purposes. In general the rules are:

  • Normalization is almost always a good idea in medium-sized apps or larger when data on both sides of the split can be modified and the potential modifications are independent of each other.

  • Updating or selecting from a single table is generally simpler than working with multiple tables, however with a well-written OR, this difference can be minimized for a large part of the data model space. Working with straight SQL, this is almost trivial to work around for an individual use case, albeit it in a non-object-oriented way.

  • Code needs to be kept relatively small to be manage-able and one effective way to do this is to divide the data model and build a service-oriented architecture around the various pieces of the data model. The goal of an optimal state of data (de)normalization should be thought of within the paradigm of your overall complexity management strategy.

In complex object hierarchies there are complexities that you don't see on the database side, like the cascading of updates. If you model relational foreign keys and crosslinks with an object ownership relationship, then when updating the object, you have to decide whether to cascade the update or not. This can be more complex than it would be in sql because of the difference between doing something once and doing something correctly always, sort of like the difference between loading a data file and writing a parser for that type of file. The code that cascades an update or delete in C++, java, or whatever will need to make the decision correctly for a variety of different scenarios, and the consequences of mistakes in this logic can be pretty serious. It remains to be proven that this can never be simplified with a bit of flexibility on the SQL side enough to make any sql complexities worthwhile.

There is also a point deserving delineation with one of the normalization precepts. A central argument for normalization in databases is the idea that data duplication is always bad. This is frequently true, but cannot be followed slavishly, especially when there are different owners for the different pieces of a solution. I saw a situation once in which one group of developers managed a certain type of transactions, and another group of developers supported auditability of these transactions, so the second group of developers wrote a service which scraped several tables whenever a transaction occurred and created a denormalized snapshot record stating, in effect, what was the state of the system at the time the transaction. This scenario stands as an interesting use case (for the data duplication part of the question at least), but it is actually part of a larger category of issues. Data constistency desires will often put certain constraints on the structure of data in the database that can make error handling and troubleshooting simpler by making some of the incorrect cases impossible. However this can also have the impact of "freezing" portions of data because changing that subset of the data would cause past transactions to become invalid under the consistancy rules. Obviously some sort of versioning system is required to sort this out, so the obvious question is whether to use a normalized versioning system (effective and expiration times) or a snapshot-based approach (value as of transaction time). There are several internal structure questions for the normalized version that you don't have to worry about with the snapshot approach, like:

  • Can date range queries be done efficiently even for large tables?
  • Is it possible to guarantee non-overlap of date ranges?
  • Is it possible to trace status events back to operator, transaction, or reason for change? (probably yes, but this is additional overhead)
  • By creating a more complicated versioning system, are you putting the right owners in charge of the right data?

I think the optimal goal here is to learn not only what is correct in theory, but why it is correct, and what are the consequences of violations, then when you are in the real world, you can decide which consequences are worth taking to gain which other benefits. That is the real challenge of design.

夜雨飘雪 2024-07-15 14:59:46

报表系统和交易系统有不同的要求。

我建议交易系统始终使用标准化来保证数据的正确性。

对于报告系统,请使用规范化,除非出于任何原因需要非规范化,例如方便即席查询、性能等。

Reporting system and transaction system have different requirements.

I would recommend for transaction system, always use normalization for data correctness.

For reporting system, use normalization unless denormaliztion is required for whatever reason, such as ease of adhoc query, performance, etc.

木緿 2024-07-15 14:59:46

简单? 不确定史蒂文是否会用他的报纸打我,但在我挂的地方,有时非规范化的表可以帮助报告/只读人员完成工作,而不会一直干扰数据库/开发人员......

Simplicity? Not sure if Steven is gonna swat me with his newspaper, but where I hang, sometimes the denormalized tables help the reporting/readonly guys get their jobs done without bugging the database/developers all the time...

腻橙味 2024-07-15 14:59:45

非规范化的两个最常见的原因是:

  1. 性能
  2. 无知

前者应该通过分析进行验证,而后者应该用卷起的报纸来纠正;-)

我想说更好的口头禅是“为了正确性而规范化,为了速度而非规范化” - 并且仅在必要时”

The two most common reasons to denormalize are:

  1. Performance
  2. Ignorance

The former should be verified with profiling, while the latter should be corrected with a rolled-up newspaper ;-)

I would say a better mantra would be "normalize for correctness, denormalize for speed - and only when necessary"

无力看清 2024-07-15 14:59:45

要充分理解原始问题的重要性,您必须了解系统开发中的团队动态,以及不同角色/类型的人倾向于的行为(或不当行为)类型。 规范化很重要,因为它不仅仅是设计模式的冷静辩论——它还与系统如何设计和管理有很大关系。

数据库人员接受的培训表明,数据完整性是最重要的问题。 我们喜欢考虑数据 100% 的正确性,这样一旦数据进入数据库,您就不必考虑或处理它在逻辑上的错误。 这种思想流派非常重视标准化,因为它导致(迫使)团队掌握数据和数据的基本逻辑。 系统。 考虑一个简单的例子——客户是否只有一个名字和名字? 地址,或者他可以有几个吗? 需要有人做出决定,系统将依赖于该规则的一致应用。 这听起来像是一个简单的问题,但当你设计一个相当复杂的系统时,将该问题乘以 500 倍,你就会看到问题——规则不能仅仅存在于纸面上,它们必须被积极执行。 良好规范化的数据库设计(在唯一性约束、外键、检查值、逻辑执行触发器等的额外帮助下)可以帮助您拥有良好定义的核心数据模型和数据正确性规则,如果当许多人在系统的不同部分(不同的应用程序、报告等)工作并且随着时间的推移不同的人在系统上工作时,您希望系统按预期工作。 或者换句话说,如果您没有某种方法来定义并在操作上强制实施可靠的核心数据模型,那么您的系统将会很糟糕。

其他人(通常是经验不足的开发人员)并不这么认为。 他们认为数据库充其量是一种受制于他们正在开发的应用程序的工具,或者最坏的情况是需要避免的官僚机构。 (请注意,我说的是“经验不足”的开发人员。优秀的开发人员将与数据库人员一样意识到需要可靠的数据模型和数据正确性。他们可能在实现这一目标的最佳方法方面存在分歧,但在我的经验是,只要数据库团队知道他们在做什么并且可以对开发人员做出响应,就可以在数据库层中完成这些事情。 这些经验不足的人通常是推动非规范化的人,这或多或少是进行快速和标准化的借口。 设计和管理数据模型的肮脏工作。 这就是您最终获得与应用程序屏幕和报告 1:1 对应的数据库表的方式,每个表都反映了不同开发人员的设计假设,并且表之间完全缺乏理智/连贯性。 在我的职业生涯中,我曾多次经历过这种情况。 这是一种令人沮丧且极其低效的系统开发方式。

因此,人们对正常化有强烈感觉的原因之一是,这个问题是他们强烈感觉的其他问题的替代品。 如果您陷入有关正常化的辩论中,请考虑各方可能为辩论带来的潜在(非技术)动机。

话虽如此,这是对原始问题的更直接的答案:)

将数据库视为由尽可能接近逻辑设计的核心设计(高度规范化和约束)和扩展的设计组成是有用的。解决其他问题(例如稳定的应用程序接口和性能)的设计。

您应该限制和规范化您的核心数据模型,因为不这样做会损害数据的基本完整性以及系统构建所依据的所有规则/假设。 如果你让这些问题远离你,你的系统很快就会变得很糟糕。 根据需求和真实数据测试您的核心数据模型,并疯狂地迭代直到它起作用。 这一步更像是澄清需求,而不是构建解决方案,而且它应该如此。 使用核心数据模型作为强制功能,为每个相关人员获得这些设计问题的明确答案。

在继续扩展数据模型之前先完成核心数据模型。 使用它,看看你能走多远。 根据数据量、用户数量和使用模式,您可能永远不需要扩展数据模型。 了解通过索引加上可在 DBMS 中使用的 1,001 个与性能相关的旋钮,您能取得多大的进步。

如果您真正发挥 DBMS 的性能管理功能,那么您需要考虑以增加非规范化的方式扩展数据模型。 请注意,这并不是要对核心数据模型进行非规范化,而是添加处理非规范数据的新资源。 例如,如果有一些巨大的查询会影响您的性能,您可能需要添加一些表来预先计算这些查询将产生的数据——本质上是预先执行查询。 重要的是,以保持非规范化数据与核心(规范化)数据的一致性的方式来执行此操作。 例如,在支持它们的 DBMS 中,您可以使用物化视图来自动维护分母数据。 如果您的 DBMS 没有此选项,那么也许您可以通过在基础数据存在的表上创建触发器来实现。

连贯的方式有选择地对数据库进行反规范化处理现实的性能挑战与仅采用薄弱的数据设计并使用性能作为理由之间存在天壤之别为了它。

当我与中低经验的数据库人员和开发人员一起工作时,我坚持他们会产生绝对规范化的设计......然后稍后可能会涉及少数更有经验的人参与选择性非规范化的讨论。 在核心数据模型中,非规范化或多或少总是不好的。 在核心之外,如果您以深思熟虑且连贯的方式进行非规范化,那么非规范化就没有任何问题。

换句话说,从正常设计反规范化为保留正常情况同时添加一些非规范的设计(处理数据的物理现实,同时保留其基本逻辑)是可以的。 没有正常设计核心的设计——甚至不应该被称为非规范化,因为它们从一开始就没有被规范化,因为它们从来没有以一种有纪律的方式有意识地设计——是不好的。

不要接受这样的术语:弱的、无纪律的设计是“非规范化”设计。 我相信故意/小心的非规范化数据与普通的旧的蹩脚数据库设计之间的混淆是许多关于非规范化争论的根本原因,因为设计者是一个粗心的白痴,导致非规范数据。

To fully understand the import of the original question, you have to understand something about team dynamics in systems development, and the kind of behavior (or misbehavior) different roles / kinds of people are predisposed to. Normalization is important because it isn't just a dispassionate debate of design patterns -- it also has a lot to do with how systems are designed and managed over time.

Database people are trained that data integrity is a paramount issue. We like to think in terms of 100% correctness of data, so that once data is in the DB, you don't have to think about or deal with it ever being logically wrong. This school of thought places a high value on normalization, because it causes (forces) a team to come to grips with the underlying logic of the data & system. To consider a trivial example -- does a customer have just one name & address, or could he have several? Someone needs to decide, and the system will come to depend on that rule being applied consistently. That sounds like a simple issue, but multiply that issue by 500x as you design a reasonably complex system and you will see the problem -- rules can't just exist on paper, they have to be actively enforced. A well-normalized database design (with the additional help of uniqueness constraints, foreign keys, check values, logic-enforcing triggers etc.) can help you have a well-defined core data model and data-correctness rules, which is really important if you want the system to work as expected when many people work on different parts of the system (different apps, reports, whatever) and different people work on the system over time. Or to put it another way -- if you don't have some way to define and operationally enforce a solid core data model, your system will suck.

Other people (often, less experienced developers) don't see it this way. They see the database as at best a tool that's enslaved to the application they're developing, or at worst a bureaucracy to be avoided. (Note I'm saying "less experienced" developers. A good developer will have the same awareness of the need for a solid data model and data correctness as a database person. They might differ on what's the best way to achieve that, but in my experience are reasonably open to doing those things in a DB layer as long as the DB team knows what they're doing and can be responsive to the developers). These less experienced folks are often the ones who push for denormalization, as more or less an excuse for doing a quick & dirty job of designing and managing the data model. This is how you end up getting database tables that are 1:1 with application screens and reports, each reflecting a different developer's design assumptions, and a complete lack of sanity / coherence between the tables. I've experienced this several times in my career. It is a disheartening and deeply unproductive way to develop a system.

So one reason people have a strong feeling about normalization is the issue is a stand-in for other issues they feel strongly about. If you are sucked into a debate about normalization, think about the underlying (non-technical) motivation the parties may be bringing to the debate.

Having said that, here's a more direct answer to the original question :)

It is useful to think of your database as consisting of a core design that is as close as possible to a logical design -- highly normalized and constrained -- and an extended design that addresses other issues like stable application interfaces and performance.

You should want to constrain and normalize your core data model, because to not do that compromises the fundamental integrity of the data and all the rules / assumptions your system is being built upon. If you let those issues get away from you, your system can get crappy pretty fast. Test your core data model against requirements and real-world data, and iterate like mad until it works. This step will feel a lot more like clarifying requirements than building a solution, and it should. Use the core data model as a forcing function to get clear answers on these design issues for everyone involved.

Complete your core data model before moving on to the extended data model. Use it and see how far you can get with it. Depending on the amount of data, number of users and patterns of use, you may never need an extended data model. See how far you can get with indexing plus the 1,001 performance-related knobs you can turn in your DBMS.

If you truly tap out the performance-management capabilities of your DBMS, then you need to look at extending your data model in a way that adds denormalization. Note this is not about denormalizing your core data model, but rather adding new resources that handle the denorm data. For example, if there are a few huge queries that crush your performance, you might want to add a few tables that precompute the data those queries would produce -- essentially pre-executing the query. It is important to do this in a way that maintains the coherence of the denormalized data with the core (normalized) data. For example, in DBMSs that support them, you can use a MATERIALIZED VIEW to make the maintenance of the denorm data automatic. If your DBMS doesn't have this option, then maybe you can do it by creating triggers on the tables where the underlying data exists.

There is a world of difference between selectively denormalizing a database in a coherent manner to deal with a realistic performance challenge vs. just having a weak data design and using performance as a justification for it.

When I work with low-to-medium experienced database people and developers, I insist they produce an absolutely normalized design ... then later may involve a small number of more experienced people in a discussion of selective denormalization. Denormalization is more or less always bad in your core data model. Outside the core, there is nothing at all wrong with denormalization if you do it in a considered and coherent way.

In other words, denormalizing from a normal design to one that preserves the normal while adding some denormal -- that deals with the physical reality of your data while preserving its essential logic -- is fine. Designs that don't have a core of normal design -- that shouldn't even be called de-normalized, because they were never normalized in the first place, because they were never consciously designed in a disciplined way -- are not fine.

Don't accept the terminology that a weak, undisciplined design is a "denormalized" design. I believe the confusion between intentionally / carefully denormalized data vs. plain old crappy database design that results in denormal data because the designer was a careless idiot is the root cause of many of the debates about denormalization.

甜扑 2024-07-15 14:59:45

非规范化通常意味着检索效率的一些提高(否则,为什么要这样做),但代价是在修改(插入、更新,有时甚至删除)操作期间验证数据的复杂性方面付出了巨大的代价。 大多数情况下,额外的复杂性被忽略(因为它太难以描述),导致数据库中出现虚假数据,而这些数据通常直到后来才被发现 - 例如当有人试图找出公司破产的原因时事实证明,数据是自不一致的,因为它是非规范化的。

我认为口头禅应该是“为了正确性而规范化,只有当高级管理层提出将你的工作交给其他人时才进行非规范化”,此时你应该接受跳槽的机会,因为只要你坚持下去,当前的工作可能就无法生存。想要。

或者“只有当管理层向您发送电子邮件,免除您所造成的混乱时,才进行非规范化”。

当然,这假设您对自己的能力和对公司的价值充满信心。

Denormalization normally means some improvement in retrieval efficiency (otherwise, why do it at all), but at a huge cost in complexity of validating the data during modify (insert, update, sometimes even delete) operations. Most often, the extra complexity is ignored (because it is too damned hard to describe), leading to bogus data in the database, which is often not detected until later - such as when someone is trying to work out why the company went bankrupt and it turns out that the data was self-inconsistent because it was denormalized.

I think the mantra should go "normalize for correctness, denormalize only when senior management offers to give your job to someone else", at which point you should accept the opportunity to go to pastures new since the current job may not survive as long as you'd like.

Or "denormalize only when management sends you an email that exonerates you for the mess that will be created".

Of course, this assumes that you are confident of your abilities and value to the company.

沉默的熊 2024-07-15 14:59:45

咒语几乎总是过于简单化其主题。 这就是一个典型的例子。

标准化的优点不仅仅是理论上或美学上的。 对于 2NF 及更高版本的每次偏离正常形式,当您不遵循正常形式时,都会出现更新异常,而当您遵循正常形式时,更新异常就会消失。 离开 1NF 是一个完全不同的问题,我不打算在这里讨论它。

这些更新异常通常分为插入新数据、更新现有数据和删除行。 通常,您可以通过巧妙、巧妙的编程来解决这些异常情况。 接下来的问题是,使用巧妙而棘手的编程所带来的好处是否值得付出代价。 有时成本就是错误。 有时代价就是失去适应性。 不管你信不信,有时代价实际上是糟糕的性能。

如果你学习了各种范式,那么在你理解伴随的更新异常之前,你应该认为你的学习是不完整的。

以“非规范化”作为指导方针的问题在于它并没有告诉你要做什么。 有多种方法可以使数据库非规范化。 他们中的大多数人都是不幸的,这是仁慈的说法。 最愚蠢的方法之一是每次您想要加速某些特定查询时,一次简单地非规范化一个步骤。 你最终会得到一个疯狂的大杂烩,如果不了解应用程序的历史就无法理解它。

许多“当时看起来是个好主意”的非规范化步骤后来证明是非常糟糕的举措。

当您决定不完全标准化时,这是一个更好的选择:采用一些能产生某些好处的设计规则,即使该设计规则背离完全标准化。 例如,星型模式设计广泛应用于数据仓库和数据集市。 这是一种比仅仅凭奇思妙想进行非规范化更加连贯和有纪律的方法。 您可以从星型模式设计中获得一些具体的好处,并且您可以将它们与您将遇到的更新异常进行对比,因为星型模式设计与规范化设计相矛盾。

一般来说,许多设计星型模式的人正在构建一个辅助数据库,一种不与 OLTP 应用程序交互的数据库。 保持此类数据库最新的最困难的问题之一是所谓的 ETL(提取、转换和加载)处理。 好消息是,所有这些处理都可以集中在少数程序中,并且处理规范化 OLTP 数据库的应用程序程序员不必学习这些东西。 有一些工具可以帮助进行 ETL,并且将数据从标准化 OLTP 数据库复制到星型模式数据集市或仓库是一个很好理解的案例。

一旦您构建了星型模式,并且正确选择了维度,明智地命名了列,特别是选择了良好的粒度,那么将此星型模式与 Cognos 或 Business Objects 等 OLAP 工具结合使用几乎就像玩游戏一样简单一款视频游戏。 这使您的数据分析师能够专注于分析数据,而不是了解数据容器的工作原理。

除了星型模式之外,还有其他偏离规范化的设计,但星型模式值得特别一提。

Mantras almost always oversimplify their subject matter. This is a case in point.

The advantages of normalizing are more that merely theoretic or aesthetic. For every departure from a normal form for 2NF and beyond, there is an update anomaly that occurs when you don't follow the normal form and that goes away when you do follow the normal form. Departure from 1NF is a whole different can of worms, and I'm not going to deal with it here.

These update anomalies generally fall into inserting new data, updating existing data, and deleting rows. You can generally work your way around these anomalies by clever, tricky programming. The question then is was the benefit of using clever, tricky programming worth the cost. Sometimes the cost is bugs. Sometimes the cost is loss of adaptability. Sometimes the cost is actually, believe it or not, bad performance.

If you learn the various normal forms, you should consider your learning incomplete until you understand the accompanying update anomaly.

The problem with "denormalize" as a guideline is that it doesn't tell you what to do. There are myriad ways to denormalize a database. Most of them are unfortunate, and that's putting it charitably. One of the dumbest ways is to simply denormalize one step at a time, every time you want to speed up some particular query. You end up with a crazy mish mosh that cannot be understood without knowing the history of the application.

A lot of denormalizing steps that "seemed like a good idea at the time" turn out later to be very bad moves.

Here's a better alternative, when you decide not to fully normalize: adopt some design discipline that yields certain benefits, even when that design discipline departs from full normalization. As an example, there is star schema design, widely used in data warehousing and data marts. This is a far more coherent and disciplined approach than merely denormalizing by whimsy. There are specific benefits you'll get out of a star schema design, and you can contrast them with the update anomalies you will suffer because star schema design contradicts normalized design.

In general, many people who design star schemas are building a secondary database, one that does not interact with the OLTP application programs. One of the hardest problems in keeping such a database current is the so called ETL (Extract, Transform, and Load) processing. The good news is that all this processing can be collected in a handful of programs, and the application programmers who deal with the normalized OLTP database don't have to learn this stuff. There are tools out there to help with ETL, and copying data from a normalized OLTP database to a star schema data mart or warehouse is a well understood case.

Once you have built a star schema, and if you have chosen your dimensions well, named your columns wisely, and especially chosen your granularity well, using this star schema with OLAP tools like Cognos or Business Objects turns out to be almost as easy as playing a video game. This permits your data analysts to focus on analysing the data instead of learning how the container of the data works.

There are other designs besides star schema that depart from normalization, but star schema is worth a special mention.

深海少女心 2024-07-15 14:59:45

维度模型中的数据仓库通常以(非规范化)星型模式建模。 这些类型的模式(通常)不用于在线生产或事务系统。

根本原因是性能,但事实/维度模型还允许许多时间特征,例如缓慢变化的维度,这在传统的 ER 样式模型中是可行的,但可能非常复杂和缓慢(有效日期、存档表、活动记录) , ETC)。

Data warehouses in a dimensional model are often modelled in a (denormalized) star schema. These kinds of schemas are not (normally) used for online production or transactional systems.

The underlying reason is performance, but the fact/dimensional model also allows for a number of temporal features like slowly changing dimensions which are doable in traditional ER-style models, but can be incredibly complex and slow (effective dates, archive tables, active records, etc).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文