是否应该对 OLAP 数据库进行非规范化以提高读取性能?

发布于 2024-10-07 05:08:38 字数 1431 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

萌逼全场 2024-10-14 05:08:38

神话

我一直认为数据库应该非规范化以供读取,就像 OLAP 数据库设计所做的那样,而不是进一步夸大 OLTP 设计的 3NF。

对此有一个神话。在关系数据库上下文中,我重新实现了六个非常大的所谓“非规范化”“数据库”;并执行了八十多项任务来纠正他人的问题,只需将其规范化、应用标准和工程原理即可。我从未见过任何证据证明这个神话。只有人们重复这个咒语,就好像这是某种神奇的祈祷一样。

标准化与非标准化

(“非标准化”是一个欺诈性术语,我拒绝使用它。)

这是一个科学行业(至少是提供不会破坏的软件的部分;这让人们在月球上;运行银行系统等)。它受物理定律支配,而不是魔法。计算机和软件都是有限的、有形的、受物理定律约束的物理对象。根据我接受的中学和高等教育:

  • 比一个更小、更薄、组织性更强的物体表现得更好。

  • 标准化会产生更多的表,是的,但每个表要小得多。尽管有更多的表,但实际上 (a) 连接更少,并且 (b) 连接速度更快,因为集合更小。总体上需要更少的索引,因为每个较小的表需要更少的索引。规范化表还会产生更短的行大小。

  • 对于任何给定的资源集,规范化表:

    • 在相同的页面大小中容纳更多行
    • 因此可以将更多行放入同一缓存空间,从而提高整体吞吐量)
    • 因此可以在相同的磁盘空间中容纳更多的行,从而减少 I/O 的数量;当需要 I/O 时,每个 I/O 的效率更高。
  • 大量重复的对象不可能比存储为单一事实版本的对象表现得更好。例如。当我删除表和列级别的 5 倍重复时,所有事务的大小都减小了;锁定减少;更新异常消失了。这大大减少了争用,从而增加了并发使用。

因此,总体结果是性能高得多。

根据我的经验,从同一个数据库提供 OLTP 和 OLAP,从来不需要“去规范化”我的规范化结构,以获得更高的只读 (OLAP) 查询速度。这也是一个神话。

  • 不,其他人要求的“非规范化”降低了速度,并且被消除了。我并不感到惊讶,但请求者再次感到惊讶。

人们写了很多书,兜售神话。需要认识到,这些人都是非技术人员;既然他们在推销魔法,那么他们推销的魔法就没有科学依据,而且他们在推销时很容易回避物理定律。

(对于任何想质疑上述物理科学的人,仅仅重复咒语是没有任何效果的,请提供支持咒语的具体证据。)

为什么这个神话普遍存在?

嗯,首先,它是在不寻求克服物理定律的方法的科学类型中并不普遍。

根据我的经验,我确定了这种流行的三个主要原因:

  1. 对于那些无法标准化数据的人来说,这是不这样做的一个方便的理由。他们可以参考魔法书,在没有任何魔法证据的情况下,他们可以虔诚地说“看到一位著名作家验证了我所做的事情”。最准确地说,尚未完成。

  2. 许多 SQL 编码人员只能编写简单的单级 SQL。规范化结构需要一定的 SQL 功能。如果他们没有那个;如果他们无法在不使用临时表的情况下生成 SELECT;如果他们不能编写子查询,他们就会在心理上坚持使用他们可以处理的平面文件(这就是“非规范化”结构)。

  3. 人们喜欢读书和讨论理论。没有经验。尤其是魔法。它是一种补品,是实际经验的替代品。任何真正正确规范化数据库的人都从未说过“非规范化比规范化更快”。对于任何说出咒语的人,我只是说“给我证据”,但他们从未拿出任何证据。所以现实是,人们出于这些原因重复神话,没有任何正常化的经验。我们是群居动物,未知是我们最大的恐惧之一。

    这就是为什么我总是在任何项目中包含“高级”SQL 和指导。

我的答案

如果我回答了您问题的每个部分,或者我回答了其他一些答案中的错误元素,那么这个答案将会长得离谱。例如。上面只回答了一项。因此,我将全面回答您的问题,而不涉及具体组成部分,并采取不同的方法。我只会涉及与你的问题相关的科学,我有资格并且非常有经验。

让我以易于管理的方式向您介绍科学。
典型的第一代“数据库”
六大大规模全面实施任务的典型模型。

  • 这些是小公司中常见的封闭“数据库”,而这些组织是大型银行,
  • 对于第一代“让应用程序运行的心态”来说非常好,但在性能、完整性和质量方面完全失败,
  • 它们的设计对于每个应用程序,单独
  • 报告是不可能的,他们只能通过每个应用程序进行报告
  • ,因为“非标准化”是一个神话,准确的技术定义是,它们非标准化
    • 为了“去规范化”,必须首先规范化;然后稍微扭转这个过程
      在人们向我展示他们的“非规范化”数据模型的每一个例子中,简单的事实是,它们根本没有规范化;所以“去规范化”是不可能的;它只是未标准化
  • 因为他们没有太多的关系技术,或者数据库的结构和控制,但它们被冒充为“数据库”,我将这些词放在引号中,
  • 因为对于非规范化结构是科学保证的,他们遭受了多个版本的事实(数据重复),因此存在高争用和低并发性,在每个版本中,
  • “数据库”中都存在额外的数据重复问题
  • 他们在组织试图保留的 所有这些重复项都是同步的,因此他们实现了复制;这当然意味着需要一台额外的服务器;待开发的ETL和同步脚本;并维护;等等
  • ,不用说,同步永远不够,他们总是
  • 用所有的争用和低吞吐量来改变它,为每个“数据库”证明一个单独的服务器是没有问题的。这并没有多大帮助。

所以我们思考了物理定律,并应用了一些科学知识。
5NF 企业数据库
我们实施了标准概念,即数据属于公司(而不是部门),而公司想要一个版本的事实。该数据库是纯关系型数据库,标准化为 5NF。纯开放架构,任何应用程序或报表工具都可以访问它。存储过程中的所有事务(而不是遍布网络的不受控制的 SQL 字符串)。经过我们的“高级”教育后,每个应用程序的相同开发人员都对新应用程序进行了编码。

显然,科学是有效的。嗯,这不是我的私人科学或魔法,而是普通的工程和物理定律。所有这些都运行在一个数据库服务器平台上;两对(生产和灾难恢复)服务器已退役并交给另一个部门。 5 个总计 720GB 的“数据库”被标准化为一个总计 450GB 的数据库。大约 700 个表(许多重复项和重复列)被规范化为 500 个不重复的表。它的执行速度要快得多,整体速度快了 10 倍,某些功能的速度快了 100 倍以上。这并不让我感到惊讶,因为那是我的意图,科学也预测到了这一点,但它让那些拥有咒语的人感到惊讶。

更多标准化

嗯,在每个项目中标准化都取得了成功,并且对所涉及的科学充满信心,标准化更多是一个自然的进展,而不是更少。以前3NF就足够了,后来NF还没有被识别出来。在过去 20 年里,我只交付了零更新异常的数据库,因此根据今天 NF 的定义,我一直交付 5NF。

同样,5NF 很棒,但也有其局限性。例如。旋转大型表(不是按照 MS PIVOT 扩展的小型结果集)的速度很慢。因此,我(和其他人)开发了一种提供标准化表格的方法,使得数据透视(a)简单并且(b)非常快。事实证明,既然 6NF 已经被定义,那么这些表就是 6NF。

由于我从同一个数据库提供 OLAP 和 OLTP,我发现,与科学一致,结构越标准化:

  • 它们执行得越快

  • 并且它们可以以更多方式使用(例如枢轴)

所以是的,我有一致且不变的经验,标准化不仅比非标准化或“德”快得多-标准化”; 更多标准化比更少标准化更快。

成功的标志之一是功能的增长(失败的标志是规模的增长但功能的增长)。这意味着他们立即要求我们提供更多的报告功能,这意味着我们标准化程度更高,并提供更多的专用表格(几年后证明是 6NF)。

围绕该主题取得进展。我一直是数据库专家,而不是数据仓库专家,因此我最初的几个仓库项目并不是成熟的实现,而是大量的性能调整任务。他们在我的范围内,在我专业的产品上。
典型数据仓库
我们不必担心标准化的确切水平等,因为我们正在研究典型的案例。我们可以认为,OLTP 数据库已合理标准化,但不具备 OLAP 能力,并且该组织购买了完全独立的 OLAP 平台、硬件;投资开发和维护大量 ETL 代码;等等。在实施之后,他们花了一半的时间来管理他们创建的重复项。在这里,书籍作者和供应商应该受到指责,因为他们导致组织购买了硬件和单独平台软件许可证的大量浪费。

  • 如果您还没有观察到,我会请您注意典型的第一代“数据库”典型的数据仓库之间的相似之处

同时,回到农场(上面的 5NF 数据库)我们只是不断添加越来越多的 OLAP 功能。当然,应用程序的功能有所增长,但这很小,业务没有改变。他们会要求更多的 6NF,而且很容易提供(5NF 到 6NF 是一小步;0NF 到任何东西,更不用说 5NF,都是一大步;有组织的架构很容易扩展)。

OLTP 和 OLAP 之间的一个主要区别(独立 OLAP 平台软件的基本理由)是 OLTP 是面向行的,它需要事务性安全的行并且速度快; OLAP 不关心事务问题,它需要列,而且速度快。这就是所有高端 BI 或 OLAP 平台都是面向列的原因,也是 OLAP 模型(星型模式、维度事实)是面向列的原因。

但对于 6NF 表:

  • 没有行,只有列;我们以相同的令人眼花缭乱的速度提供行和列

  • 表格(即6NF结构的5NF视图)已经组织成维度事实。事实上,它们被组织成比任何 OLAP 模型所能识别的更多维度,因为它们是所有维度。

  • 通过动态聚合来透视整个表(与少量派生列的透视相反)是(a)毫不费力、简单的代码,并且(b)非常快
    Typical Data Warehouse

根据定义,我们多年来一直提供的是至少具有 5NF 的关系数据库供 OLTP 使用, 6NF 用于 OLAP 要求。

  • 请注意,这与我们从一开始就使用的科学完全相同;从典型的非标准化“数据库”迁移到5NF企业数据库。我们只是应用更多经过验证的科学,并获得更高级别的功能和性能。

  • 注意5NF企业数据库6NF企业数据库之间的相似性

  • 单独的 OLAP 硬件、平台软件、ETL、管理、维护的全部成本都被消除。

  • 数据只有一个版本,无更新异常或维护;相同的数据为 OLTP 提供为行,为 OLAP 提供为列

我们唯一没有做的事情是开始一个新项目,并从一开始就声明纯 6NF。这就是我接下来要排队的。

什么是第六范式?

假设您了解规范化(我不会在这里不定义它),与此主题相关的非学术定义如下。请注意,它适用于表级别,因此您可以在同一数据库中混合使用 5NF 和 6NF 表:

  • 第五范式:跨数据库解决所有功能依赖关系
    • 除了 4NF/BCNF
    • 每个非 PK 列与其 PK 都是 1::1
    • 没有其他PK
    • 无更新异常
  • 第六范式:是不可约NF,即数据不能进一步约简或归一化的点(不会有7NF)
    • 除了 5NF
    • 该行由一个主键和最多一个非键列组成
    • 消除了零问题

6NF 是什么样的?

数据模型属于客户,我们的知识产权不可免费出版。但我确实访问了这个网站,并提供了问题的具体答案。您确实需要一个真实的示例,因此我将发布我们内部实用程序之一的数据模型。

该数据用于收集任何时期任何客户的服务器监控数据(企业级数据库服务器和操作系统)。我们用它来远程分析性能问题,并验证我们所做的任何性能调整。该结构十多年来没有改变(添加,没有改变现有结构),这是典型的专业5NF,多年后被确定为6NF。允许完全旋转;在任何维度上绘制的任何图表或图形(提供 22 个枢轴,但这不是限制);切片和切丁;混合搭配。请注意,它们是所有维度。

监控数据或指标或向量可以更改(服务器版本更改;我们想要获取更多内容),而不会影响模型(您可能还记得在另一篇文章中我说过 EAV 是 6NF 的私生子;这就是完整的 6NF,未稀释的父亲,因此提供了 EAV 的所有功能,而不牺牲任何标准、完整性或关系力);您只需添加行。

▶监控统计数据模型◀。 (对于内联来说太大;某些浏览器无法加载内联;单击链接)

它允许我生成这些 ▶像这样的图表◀,收到客户的原始监控统计文件后按六次按键。注意混合搭配;操作系统和服务器在同一张图表上;各种枢轴。 (经许可使用。)

不熟悉关系数据库建模标准的读者可以找到 ▶IDEF1X 表示法◀ 有帮助。

6NF 数据仓库

最近已通过锚定建模验证了这一点,因为它们现在将 6NF 视为数据仓库的“下一代”OLAP 模型。 (他们不提供来自单一版本数据的 OLTP 和 OLAP,那是我们自己的)。

数据仓库(仅)经验

我仅在数据仓库(不是上述 6NF OLTP-OLAP 数据库)方面的经验是几项主要任务,而不是完整的实施项目。结果毫不奇怪:

  • 与科学一致,标准化结构的执行速度要快得多;更容易维护;并且需要较少的数据同步。 Inmon,而不是 Kimball。

  • 与魔法一致,在我对一堆表格进行归一化,并通过应用物理定律大幅提高性能后,唯一令人惊讶的是魔术师的咒语。

有科学头脑的人不会这样做;有科学头脑的人不会这样做。他们不相信也不依赖灵丹妙药和魔法;他们利用科学并努力工作来解决他们的问题。

有效的数据仓库理由

这就是为什么我在其他帖子中指出,单独的数据仓库平台、硬件、ETL、维护等的唯一有效理由是有许多数据库或“数据库”,全部合并到一个中央仓库中,用于报告和 OLAP。

Kimball

有必要谈谈 Kimball,因为他是数据仓库中“非规范化性能”的主要支持者。根据我上面的定义,他是那些显然在生活中从未正常化的人之一;他的起点是非标准化的(伪装为“非标准化”),他只是在维度事实模型中实现了这一点。

  • 当然,为了获得任何性能,他必须更加“去规范化”,并创建更多的重复项,并证明这一切是合理的。

    • 因此,通过制作更专门的副本,以一种精神分裂的方式“去规范化”非规范化结构,“提高了读取性能”,这是事实。当考虑到整体时,情况就不是这样了。这只在那个小庇护所内是正确的,在外面则不然。

    • 同样,以这种疯狂的方式,所有“表”都是怪物的情况下,“连接是昂贵的”也是应该避免的。他们从未有过连接较小的桌子和集合的经验,因此他们无法相信更多、更小的桌子更快的科学事实。

    • 他们的经验是创建重复“表”的速度更快,因此他们无法相信消除重复的速度比这还要快。

      < /里>

  • 他的维度添加到非标准化数据中。数据没有标准化,因此没有暴露任何维度。而在标准化模型中,维度已经公开,作为数据的组成部分,不需要添加

  • 金博尔那条铺得很好的路通向悬崖,在那里更多的旅鼠会更快地坠落死亡。旅鼠是群居动物,只要它们一起走,一起死,它们就死得很开心。旅鼠不会寻找其他路径。

所有这些都只是故事,是一个神话的一部分,相互关联、相互支持。

您的使命

如果您选择接受它。我要求你独立思考,停止任何与科学和物理定律相矛盾的想法。无论它们多么常见、神秘或神话。在相信任何事情之前先寻找证据。保持科学性,为自己验证新的信念。重复“为了性能而去规范化”的口头禅不会让你的数据库变得更快,它只会让你感觉更好。就像坐在场边的胖孩子告诉自己,他可以比比赛中的所有孩子跑得更快。

  • 在此基础上,即使是“针对 OLTP 标准化”的概念却反其道而行之,“针对 OLAP 去标准化”也是一个矛盾。物理定律如何在一台计算机上按规定工作,但在另一台计算机上却相反?头脑令人难以置信。在每台计算机上都以同样的方式工作是根本不可能的。

问题 ?

Mythology

I always thought that databases should be denormalized for reading, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.

There's a myth to that effect. In the Relational Database context, I have re-implemented six very large so-called "de-normalised" "databases"; and executed over eighty assignments correcting problems on others, simply by Normalising them, applying Standards and engineering principles. I have never seen any evidence for the myth. Only people repeating the mantra as if it were some sort of magical prayer.

Normalisation vs Un-normalised

("De-normalisation" is a fraudulent term I refuse to use it.)

This is a scientific industry (at least the bit that delivers software that does not break; that put people on the Moon; that runs banking systems; etc). It is governed by the laws of physics, not magic. Computers and software are all finite, tangible, physical objects that are subject to the laws of physics. According to the secondary and tertiary education I received:

  • it is not possible for a bigger, fatter, less organised object to perform better than a smaller, thinner, more organised object.

  • Normalisation yields more tables, yes, but each table is much smaller. And even though there are more tables, there are in fact (a) fewer joins and (b) the joins are faster because the sets are smaller. Fewer Indices are required overall, because each smaller table needs fewer indices. Normalised tables also yield much shorter row sizes.

  • for any given set of resources, Normalised tables:

    • fit more rows into the same page size
    • therefore fit more rows into the same cache space, therefore overall throughput is increased)
    • therefore fit more rows into the same disk space, therefore the no of I/Os is reduced; and when I/O is called for, each I/O is more efficient.
      .
  • it is not possible for an object that is heavily duplicated to perform better than an object that is stored as a single version of the truth. Eg. when I removed the 5 x duplication at the table and column level, all the transactions were reduced in size; the locking reduced; the Update Anomalies disappeared. That substantially reduced contention and therefore increased concurrent use.

The overall result was therefore much, much higher performance.

In my experience, which is delivering both OLTP and OLAP from the same database, there has never been a need to "de-normalise" my Normalised structures, to obtain higher speed for read-only (OLAP) queries. That is a myth as well.

  • No, the "de-normalisation" requested by others reduced speed, and it was eliminated. No surprise to me, but again, the requesters were surprised.

Many books have been written by people, selling the myth. It needs to be recognised that these are non-technical people; since they are selling magic, the magic they sell has no scientific basis, and they conveniently avoid the laws of physics in their sales pitch.

(For anyone who wishes to dispute the above physical science, merely repeating the mantra will no have any effect, please supply specific evidence supporting the mantra.)

Why is the Myth Prevalent ?

Well, first, it is not prevalent among scientific types, who do not seek ways of overcoming the laws of physics.

From my experience, I have identified three major reasons for the prevalence:

  1. For those people who cannot Normalise their data, it is a convenient justification for not doing so. They can refer to the magic book and without any evidence for the magic, they can reverently say "see a famous writer validates what I have done". Not Done, most accurately.

  2. Many SQL coders can write only simple, single-level SQL. Normalised structures require a bit of SQL capability. If they do not have that; if they cannot produce SELECTs without using temporary tables; if they cannot write Sub-queries, they will be psychologically glued to the hip to flat files (which is what "de-normalised" structures are), which they can process.

  3. People love to read books, and to discuss theories. Without experience. Especially re magic. It is a tonic, a substitute for actual experience. Anyone who has actually Normalised a database correctly has never stated that "de-normalised is faster than normalised". To anyone stating the mantra, I simply say "show me the evidence", and they have never produced any. So the reality is, people repeat the mythology for these reasons, without any experience of Normalisation. We are herd animals, and the unknown is one of our biggest fears.

    That is why I always include "advanced" SQL and mentoring on any project.

My Answer

This Answer is going to be ridiculously long if I answer every part of your question or if I respond to the incorrect elements in some of the other answers. Eg. the above has answered just one item. Therefore I will answer your question in total without addressing the specific components, and take a different approach. I will deal only in the science related to your question, that I am qualified in, and very experienced with.

Let me present the science to you in manageable segments.
Typical First Generation "databases"
The typical model of the six large scale full implementation assignments.

  • These were the closed "databases" commonly found in small firms, and the organisations were large banks
  • very nice for a first generation, get-the-app-running mindset, but a complete failure in terms of performance, integrity and quality
  • they were designed for each app, separately
  • reporting was not possible, they could only report via each app
  • since "de-normalised" is a myth, the accurate technical definition is, they were un-normalised
    • In order to "de-normalise" one must Normalise first; then reverse the process a little
      in every instance where people showed me their "de-normalised" data models, the simple fact was, they had not Normalised at all; so "de-normalisation" was not possible; it was simply un-normalised
  • since they did not have much Relational technology, or the structures and control of Databases, but they were passed off as "databases", I have placed those words in quotation marks
  • as is scientifically guaranteed for un-normalised structures, they suffered multiple versions of the truth (data duplication) and therefore high contention and low concurrency, within each of them
  • they had an additional problem of data duplication across the "databases"
  • the organisation was trying to keep all those duplicates synchronised, so they implemented replication; which of course meant an additional server; ETL and synching scripts to be developed; and maintained; etc
  • needless to say, the synching was never quite enough and they were forever changing it
  • with all that contention and low throughput, it was no problem at all justifying a separate server for each "database". It did not help much.

So we contemplated the laws of physics, and we applied a little science.
5NF Corporate Database
We implemented the Standard concept that the data belongs to the corporation (not the departments) and the corporation wanted one version of the truth. The Database was pure Relational, Normalised to 5NF. Pure Open Architecture, so that any app or report tool could access it. All transactions in stored procs (as opposed to uncontrolled strings of SQL all over the network). The same developers for each app coded the new apps, after our "advanced" education.

Evidently the science worked. Well, it wasn't my private science or magic, it was ordinary engineering and the laws of physics. All of it ran on one database server platform; two pairs (production & DR) of servers were decommissioned and given to another department. The 5 "databases" totalling 720GB were Normalised into one Database totalling 450GB. About 700 tables (many duplicates and duplicated columns) were normalised into 500 unduplicated tables. It performed much faster, as in 10 times faster overall, and more than 100 times faster in some functions. That did not surprise me, because that was my intention, and the science predicted it, but it surprised the people with the mantra.

More Normalisation

Well, having had success with Normalisation in every project, and confidence with the science involved, it has been a natural progression to Normalise more, not less. In the old days 3NF was good enough, and later NFs were not yet identified. In the last 20 years, I have only delivered databases that had zero update anomalies, so it turns out by todays definitions of NFs, I have always delivered 5NF.

Likewise, 5NF is great but it has its limitations. Eg. Pivoting large tables (not small result sets as per the MS PIVOT Extension) was slow. So I (and others) developed a way of providing Normalised tables such that Pivoting was (a) easy and (b) very fast. It turns out, now that 6NF has been defined, that those tables are 6NF.

Since I provide OLAP and OLTP from the same database, I have found that, consistent with the science, the more Normalised the structures are:

  • the faster they perform

  • and they can be used in more ways (eg Pivots)

So yes, I have consistent and unvarying experience, that not only is Normalised much, much faster than un-normalised or "de-normalised"; more Normalised is even faster than less normalised.

One sign of success is growth in functionality (the sign of failure is growth in size without growth in functionality). Which meant they immediately asked us for more reporting functionality, which meant we Normalised even more, and provided more of those specialised tables (which turned out years later, to be 6NF).

Progressing on that theme. I was always a Database specialist, not a data warehouse specialist, so my first few projects with warehouses were not full-blown implementations, but rather, they were substantial performance tuning assignments. They were in my ambit, on products that I specialised in.
Typical Data Warehouse
Let's not worry about the exact level of normalisation, etc, because we are looking at the typical case. We can take it as given that the OLTP database was reasonably normalised, but not capable of OLAP, and the organisation had purchased a completely separate OLAP platform, hardware; invested in developing and maintaining masses of ETL code; etc. And following implementation then spent half their life managing the duplicates they had created. Here the book writers and vendors need to be blamed, for the massive waste of hardware and separate platform software licences they cause organisations to purchase.

  • If you have not observed it yet, I would ask you to notice the similarities between the Typical First Generation "database" and the Typical Data Warehouse

Meanwhile, back at the farm (the 5NF Databases above) we just kept adding more and more OLAP functionality. Sure the app functionality grew, but that was little, the business had not changed. They would ask for more 6NF and it was easy to provide (5NF to 6NF is a small step; 0NF to anything, let alone 5NF, is a big step; an organised architecture is easy to extend).

One major difference between OLTP and OLAP, the basic justification of separate OLAP platform software, is that the OLTP is row-oriented, it needs transactionally secure rows, and fast; and the OLAP doesn't care about the transactional issues, it needs columns, and fast. That is the reason all the high end BI or OLAP platforms are column-oriented, and that is why the OLAP models (Star Schema, Dimension-Fact) are column-oriented.

But with the 6NF tables:

  • there are no rows, only columns; we serve up rows and columns at same blinding speed

  • the tables (ie. the 5NF view of the 6NF structures) are already organised into Dimension-Facts. In fact they are organised into more Dimensions than any OLAP model would ever identify, because they are all Dimensions.

  • Pivoting entire tables with aggregation on the fly (as opposed to the PIVOT of a small number of derived columns) is (a) effortless, simple code and (b) very fast
    Typical Data Warehouse

What we have been supplying for many years, by definition, is Relational Databases with at least 5NF for OLTP use, and 6NF for OLAP requirements.

  • Notice that it is the very same science that we have used from the outset; to move from Typical un-normalised "databases" to 5NF Corporate Database. We are simply applying more of the proven science, and obtaining higher orders of functionality and performance.

  • Notice the similarity between 5NF Corporate Database and 6NF Corporate Database

  • The entire cost of separate OLAP hardware, platform software, ETL, administration, maintenance, are all eliminated.

  • There is only one version of the data, no update anomalies or maintenance thereof; the same data served up for OLTP as rows, and for OLAP as columns

The only thing we have not done, is to start off on a new project, and declare pure 6NF from the start. That is what I have lined up next.

What is Sixth Normal Form ?

Assuming you have a handle on Normalisation (I am not going to not define it here), the non-academic definitions relevant to this thread are as follows. Note that it applies at the table level, hence you can have a mix of 5NF and 6NF tables in the same database:

  • Fifth Normal Form: all Functional Dependencies resolved across the database
    • in addition to 4NF/BCNF
    • every non-PK column is 1::1 with its PK
    • and to no other PK
    • No Update Anomalies
      .
  • Sixth Normal Form: is the irreducible NF, the point at which the data cannot be further reduced or Normalised (there will not be a 7NF)
    • in addition to 5NF
    • the row consists of a Primary Key, and at most, one non-key column
    • eliminates The Null Problem

What Does 6NF Look Like ?

The Data Models belong to the customers, and our Intellectual Property is not available for free publication. But I do attend this web-site, and provide specific answers to questions. You do need a real world example, so I will publish the Data Model for one of our internal utilities.

This one is for the collection of server monitoring data (enterprise class database server and OS) for any no of customers, for any period. We use this to analyse performance issues remotely, and to verify any performance tuning that we do. The structure has not changed in over ten years (added to, with no change to the existing structures), it is typical of the specialised 5NF that many years later was identified as 6NF. Allows full pivoting; any chart or graph to be drawn, on any Dimension (22 Pivots are provided but that is not a limit); slice and dice; mix and match. Notice they are all Dimensions.

The monitoring data or Metrics or vectors can change (server version changes; we want to pick up something more) without affecting the model (you may recall in another post I stated EAV is the bastard son of 6NF; well this is full 6NF, the undiluted father, and therefore provides all features of EAV, without sacrificing any Standards, integrity or Relational power); you merely add rows.

▶Monitor Statistics Data Model◀. (too large for inline; some browsers cannot load inline; click the link)

It allows me to produce these ▶Charts Like This◀, six keystrokes after receiving a raw monitoring stats file from the customer. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. (Used with permission.)

Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the ▶IDEF1X Notation◀ helpful.

6NF Data Warehouse

This has been recently validated by Anchor Modeling, in that they are now presenting 6NF as the "next generation" OLAP model for data warehouses. (They do not provide the OLTP and OLAP from the single version of the data, that is ours alone).

Data Warehouse (Only) Experience

My experience with Data Warehouses only (not the above 6NF OLTP-OLAP Databases), has been several major assignments, as opposed to full implementation projects. The results were, no surprise:

  • consistent with the science, Normalised structures perform much faster; are easier to maintain; and require less data synching. Inmon, not Kimball.

  • consistent with the magic, after I Normalise a bunch of tables, and deliver substantially improved performance via application of the laws of physics, the only people surprised are the magicians with their mantras.

Scientifically minded people do not do that; they do not believe in, or rely upon, silver bullets and magic; they use and hard work science to resolve their problems.

Valid Data Warehouse Justification

That is why I have stated in other posts, the only valid justification for a separate Data Warehouse platform, hardware, ETL, maintenance, etc, is where there are many Databases or "databases", all being merged into a central warehouse, for reporting and OLAP.

Kimball

A word on Kimball is necessary, as he is the main proponent of "de-normalised for performance" in data warehouses. As per my definitions above, he is one of those people who have evidently never Normalised in their lives; his starting point was un-normalised (camouflaged as "de-normalised") and he simply implemented that in a Dimension-Fact model.

  • Of course, to obtain any performance, he had to "de-normalise" even more, and create further duplicates, and justify all that.

    • So therefore it is true, in a schizophrenic sort of way, that "de-normalising" un-normalised structures, by making more specialised copies, "improves read performance". It is not true when the whole is taking into account; it is true only inside that little asylum, not outside.

    • Likewise it is true, in that crazy way, that where all the "tables" are monsters, that "joins are expensive" and something to be avoided. They have never had the experience of joining smaller tables and sets, so they cannot believe the scientific fact that more, smaller tables are faster.

    • they have experience that creating duplicate "tables" is faster, so they cannot believe that eliminating duplicates is even faster than that.

  • his Dimensions are added to the un-normalised data. Well the data is not Normalised, so no Dimensions are exposed. Whereas in a Normalised model, the Dimensions are already exposed, as an integral part of the data, no addition is required.

  • that well-paved path of Kimball's leads to the cliff, where more lemmings fall to their deaths, faster. Lemmings are herd animals, as long as they are walking the path together, and dying together, they die happy. Lemmings do not look for other paths.

All just stories, parts of the one mythology that hang out together and support each other.

Your Mission

Should you choose to accept it. I am asking you to think for yourself, and to stop entertaining any thoughts that contradict science and the laws of physics. No matter how common or mystical or mythological they are. Seek evidence for anything before trusting it. Be scientific, verify new beliefs for yourself. Repeating the mantra "de-normalised for performance" won't make your database faster, it will just make you feel better about it. Like the fat kid sitting in the sidelines telling himself that he can run faster than all the kids in the race.

  • on that basis, even the concept "normalise for OLTP" but do the opposite, "de-normalise for OLAP" is a contradiction. How can the laws of physics work as stated on one computer, but work in reverse on another computer ? The mind boggles. It is simply not possible, the work that same way on every computer.

Questions ?

淡淡離愁欲言轉身 2024-10-14 05:08:38

非规范化和聚合是用于实现数据仓库性能的两种主要策略。认为它不能提高读取性能是愚蠢的!我肯定在这里误解了什么?

聚合:
考虑一个包含 10 亿次购买的表。
将其与一张表格进行对比,该表格中的一行包含购买的总和。
现在,哪个更快?从 10 亿行表中选择 sum(amount) 还是从 1 行表中选择 amount?这当然是一个愚蠢的例子,但它非常清楚地说明了聚合的原理。为什么更快?因为无论我们使用什么神奇的模型/硬件/软件/宗教,读取 100 字节都比读取 100 GB 快。就这么简单。

非规范化:
零售数据仓库中的典型产品维度包含大量列。有些列很简单,例如“名称”或“颜色”,但也有一些复杂的内容,例如层次结构。多个层次结构(产品范围(5 个级别)、目标买家(3 个级别)、原材料(8 个级别)、生产方式(8 个级别)以及一些计算数字,例如平均交货时间(自年初以来)我维护了一个包含 200 多个列的产品尺寸表,该表是由来自 5 个不同源系统的约 70 个表构建的,争论是否对标准化模型进行查询(如下)是很愚蠢的。

select product_id
  from table1
  join table2 on(keys)
  join (select average(..)
          from one_billion_row_table 
         where lastyear = ...) on(keys)
  join ...table70
 where function_with_fuzzy_matching(table1.cola, table37.colb) > 0.7
   and exists(select ... from )
   and not exists(select ...)
   and table20.version_id = (select max(v_id from product_ver where ...)
   and average_price between 10 and 20
   and product_range = 'High-Profile'

...比非规范化模型上的等效查询更快:

select product_id
  from product_denormalized
 where average_price between 10 and 20
   and product_range = 'High-Profile';

为什么?部分原因与聚合场景相同,但也因为查询太复杂了(现在我')。如果查询处理少量数据,那么执行计划可能会变得混乱并搞砸。但是一旦我们开始加入大表,情况就会变得。 >数据库正确执行执行计划至关重要。使用单个合成键对一个表中的数据进行非规范化(哎呀,为什么我不为这场持续的大火添加更多燃料),过滤器成为预煮列上的简单范围/相等过滤器。将数据复制到新列中使我们能够收集有关列的统计信息,这将有助于优化器估计选择性,从而为我们提供正确的执行计划(嗯,...)。

显然,使用非规范化和聚合会使适应模式更改变得更加困难,这是一件坏事。另一方面,它们提供了读取性能,这是一件好事。

那么,您是否应该对数据库进行非规范化以实现读取性能?
天哪,不!它给你的系统增加了如此多的复杂性,以至于在你交付之前它会以无数种方式把你搞砸。值得吗?是的,有时您需要这样做才能满足特定的性能要求。

更新 1

PerformanceDBA:1 行每天更新十亿次

这意味着(接近)实时要求(这反过来又会产生一组完全不同的技术要求)。许多(如果不是大多数)数据仓库没有这种要求。我选择了一个不切实际的聚合示例,只是为了阐明聚合为何有效。我也不想解释汇总策略 :)

另外,必须将数据仓库的典型用户和底层 OLTP 系统的典型用户的需求进行对比。如果用户想要了解哪些因素会影响运输成本,那么他们不会关心当今 50% 的数据是否丢失,或者是否有 10 辆卡车爆炸并导致司机死亡。即使他拥有最新的信息,对两年的数据进行分析仍然会得出相同的结论。

将此与卡车司机(幸存的司机)的需求进行对比。他们不能仅仅因为一些愚蠢的聚合过程必须完成而在某个中转站等待 5 个小时。拥有两个独立的数据副本可以满足这两个需求。

为操作系统和报告系统共享同一组数据的另一个主要障碍是发布周期、问答、部署、SLA 等等都非常不同。同样,拥有两个单独的副本使这更容易处理。

Denormalization and aggregation are the two main strategies used to achieve performance in a data warehouse. It's just silly to suggest that it doesn't improve read performance! Surely I must have missunderstood something here?

Aggregation:
Consider a table holding 1 billion purchases.
Contrast it with a table holding one row with the sum of the purchases.
Now, which is faster? Select sum(amount) from the one-billion-row table or a select amount from the one-row-table? It's a stupid example of course, but it illustrates the principle of aggregation quite clearly. Why is it faster? Because regardless of what magical model/hardware/software/religion we use, reading 100 bytes is faster than reading 100 gigabytes. Simple as that.

Denormalization:
A typical product dimension in a retail data warehouse has shitloads of columns. Some columns are easy stuff like "Name" or "Color", but it also has some complicated stuff, like hierarchies. Multiple hierarchies (The product range (5 levels), the intended buyer (3 levels), raw materials (8 levels), way of production (8 levels) along with several computed numbers such as average lead time (since start of the year), weight/packaging measures etcetera etcetera. I've maintained a product dimension table with 200+ columns that was constructed from ~70 tables from 5 different source systems. It is just plain silly to debate whether a query on the normalized model (below)

select product_id
  from table1
  join table2 on(keys)
  join (select average(..)
          from one_billion_row_table 
         where lastyear = ...) on(keys)
  join ...table70
 where function_with_fuzzy_matching(table1.cola, table37.colb) > 0.7
   and exists(select ... from )
   and not exists(select ...)
   and table20.version_id = (select max(v_id from product_ver where ...)
   and average_price between 10 and 20
   and product_range = 'High-Profile'

...is faster than the equivalent query on the denormalized model:

select product_id
  from product_denormalized
 where average_price between 10 and 20
   and product_range = 'High-Profile';

Why? Partly for the same reason as the aggregated scenario. But also because the queries are just "complicated". They are so disgustingly complicated that the optimizer (and now I'm going Oracle specifics) gets confused and screws up the execution plans. Suboptimal execution plans may not be such a big deal if the query deals with small amounts of data. But as soon as we start to join in the Big Tables it is crucial that the database gets the execution plan right. Having denormalized the data in one table with a single syntetic key (heck, why don't I add more fuel to this ongoing fire), the filters become simple range/equality filters on pre-cooked columns. Having duplicated the data into new columns enables us to gather statistics on the columns which will help the optimizer in estimating the selectivities and thus providing us with a proper execution plan (well, ...).

Obviously, using denormalization and aggregation makes it harder to accomodate schema changes which is a bad thing. On the other hand they provides read performance, which is a good thing.

So, should you denormalize your database in order to achieve read-performance?
Hell no! It adds so many complexities to your system that there is no end to how many ways it will screw you over before you have delivered. Is it worth it? Yes, sometimes you need to do it to meet a specific performance requirement.

Update 1

PerformanceDBA: 1 row would get updated a billion times a day

That would imply a (near) realtime requirement (which in turn would generate a completely different set of technical requirements). Many (if not most) data warehouses does not have that requirement. I picked an unrealistic aggregation example just to make it clear why aggregation works. I didn't want to have to explain rollup strategies too :)

Also, one has to contrast the needs of the typical user of a data warehouse and the typical user of the underlaying OLTP system. A user looking to understand what factors drive transport costs, couldn't care less if 50% of todays data is missing or if 10 trucks exploded and killed the drivers. Performing the analysis over 2 years worth of data would still come to the same conclusion even if he had to-the-second up-to-date information at his disposal.

Contrast this to the needs of the drivers of that truck (the ones who survived). They can't wait 5 hours at some transit point just because some stupid aggregation process has to finnish. Having two separate copies of the data solves both needs.

Another major hurdle with sharing the same set of data for operational systems and reporting systems is that the release cycles, Q&A, deployment, SLA and what have you, are very different. Again, having two separate copies makes this easier to handle.

錯遇了你 2024-10-14 05:08:38

我将“OLAP”理解为用于决策支持的面向主题的关系/SQL 数据库 - 又称数据仓库。

范式(通常是第五/第六范式)通常是数据仓库的最佳模型。规范化数据仓库的原因与任何其他数据库完全相同:减少冗余并避免潜在的更新异常;它避免了内置偏差,因此是支持模式更改和新需求的最简单方法。在数据仓库中使用范式还有助于保持数据加载过程简单且一致。

不存在“传统的”非规范化方法。好的数据仓库总是标准化的。

By "OLAP" I understand you to mean a subject-oriented relational / SQL database used for decision support - AKA a Data Warehouse.

Normal Form (typically 5th / 6th Normal Form) is generally the best model for a Data Warehouse. The reasons for normalizing a Data Warehouse are exactly the same as any other database: it reduces redundancy and avoids potential update anomalies; it avoids built-in bias and is therefore the easiest way to support schema change and new requirements. Using Normal Form in a data warehouse also helps keep the data load process simple and consistent.

There is no "traditional" denormalization approach. Good data warehouses have always been normalized.

清欢 2024-10-14 05:08:38

数据库不应该为了读取性能而进行非规范化吗?

好吧,这里是“你的里程可能会有所不同”,“这取决于”,“为每项工作使用适当的工具”,“一种尺寸并不适合所有”的答案,还有一点“如果它不修复它” Ain't Broken”抛出:

非规范化是在某些情况下提高查询性能的一种方法。在其他情况下,它实际上可能会降低性能(因为磁盘使用量增加)。这无疑使更新变得更加困难。

仅当遇到性能问题时才应考虑它(因为您提供了标准化的好处并引入了复杂性)。

非规范化的缺点是对于从未更新或仅在批处理作业中更新的数据(即不是 OLTP 数据)而言问题不大。

如果非规范化解决了您需要解决的性能问题,而侵入性较小的技术(如索引或缓存或购买更大的服务器)无法解决,那么是的,您应该这样做。

Should not a database be denormalized for reading performance?

Okay, here goes a total "Your Mileage May Vary", "It Depends", "Use The Proper Tool For Every Job", "One Size Does Not Fit All" answer, with a bit of "Don't Fix It If It Ain't Broken" thrown in:

Denormalization is one way to improve query performance in certain situations. In other situations it may actually reduce performance (because of the increased disk use). It certainly makes updates more difficult.

It should only be considered when you hit a performance problem (because you are giving the benefits of normalization and introduce complexity).

The drawbacks of denormalization are less of an issue with data that is never updated, or only updated in batch jobs, i.e. not OLTP data.

If denormalization solves a performance problem that you need solved, and that less invasive techniques (like indexes or caches or buying a bigger server) do not solve, then yes, you should do it.

冷了相思 2024-10-14 05:08:38

首先是我的观点,然后是一些分析

观点
非规范化被认为有助于读取数据,因为非规范化一词的常见用法通常不仅包括破坏正常形式,还包括向系统引入任何插入、更新和删除依赖项。

严格来说,这是错误,请参阅此问题/答案 ,严格意义上的非规范化意味着打破 1NF-6NF 中的任何范式,其他插入、更新和删除依赖关系通过 正交设计原理

因此,人们采用空间与时间权衡原则并记住术语冗余(与非规范化相关,但仍然不等于它)并得出结论,你应该有好处。这是错误的暗示,但错误的暗示不允许你得出相反的结论。

打破范式可能确实会加快某些数据检索速度(详细信息见下文分析),但通常它也会同时:

  • 只支持特定类型的查询,并且减慢所有其他访问路径
  • 增加系统的复杂性(这不仅影响数据库本身的维护,而且还增加使用数据的应用程序的复杂性)
  • 混淆和削弱数据库的语义清晰度
  • 数据库系统的要点,作为中心代表问题空间的数据要公正地记录事实,这样当需求发生变化时,您不必重新设计系统中实际上独立的部分(数据和应用程序)。为了能够做到这一点,人为的依赖关系应该最小化——今天加速一个查询的“关键”要求常常变得微不足道。

分析

因此,我声称​​有时打破正常形式可以帮助检索。是时候给出一些论据了

1) 打破 1NF

假设您有 6NF 的财务记录。从这样的数据库中,您肯定可以获得有关每个帐户每月余额的报告。

假设必须计算此类报告的查询需要遍历 n 条记录,您可以创建一个表

account_balances(month, report)

来保存每个帐户的 XML 结构化余额。这会破坏 1NF(请参阅后面的注释),但允许使用最小 I/O 执行一个特定查询。

同时,假设可以通过插入、更新或删除财务记录来更新任何月份,则系统上更新查询的性能可能会因与 n 对于每次更新
(上面的例子说明了一个原则,实际上你会有更好的选择,并且获得最小 I/O 的好处会带来这样的惩罚,对于经常更新数据的现实系统,即使是你的目标查询,也会得到糟糕的性能,具体取决于实际工作负载的类型;如果需要,可以更详细地解释这一点)

注意:
这实际上是一个微不足道的例子,但它有一个问题——1NF 的定义。上述模型违反 1NF 的假设是根据属性值“仅包含适用域中的一个值”的要求。

这允许您说属性报告的域是所有可能报告的集合,并且所有报告中都只有一个值,并声称 1NF 没有被破坏(类似于存储单词不会破坏 1NF 的论点,即使您的模型中的某处可能有letters 关系)。

另一方面,有更好的方法来对该表进行建模,这对于更广泛的查询更有用(例如检索一年中所有月份的单个帐户的余额)。在这种情况下,您可以通过说该字段不在 1NF 中来证明该改进的合理性。

无论如何,它解释了为什么人们声称打破 NF 可能会提高性能。

2) 打破 3NF

假设 3NF 中的表

CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `member_id` int(10) unsigned NOT NULL,
  `status` tinyint(3) unsigned NOT NULL,
  `amount` decimal(10,2) NOT NULL,
  `opening` decimal(10,2) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `member_id` (`member_id`),
  CONSTRAINT `t_ibfk_1` FOREIGN KEY (`member_id`) REFERENCES `m` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB

CREATE TABLE `m` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB

包含示例数据(t 中有 1M 行,m 中有 100k)

假设您想要改进一个常见查询,

mysql> select sql_no_cache m.name, count(*) 
       from t join m on t.member_id = m.id 
       where t.id between 100000 and 500000 group by m.name;
+-------+----------+
| name  | count(*) |
+-------+----------+
| omega |       11 |
| test  |        8 |
| test3 |   399982 |
+-------+----------+
3 rows in set (1.08 sec)

您可以找到移动属性 name name 且 member_id 不是 t 的键)

code> 到表 m 中,在

alter table t add column varchar(255);
update t inner join m on t.member_id = t.id set t.name = m.name;

运行

mysql> select sql_no_cache name, count(*) 
       from t where id 
       between 100000 and 500000 
       group by name;
+-------+----------+
| name  | count(*) |
+-------+----------+
| omega |       11 |
| test  |        8 |
| test3 |   399982 |
+-------+----------+
3 rows in set (0.41 sec)

注释 :
上面的查询执行时间减少了一半,但是

  • 该表一开始就不在 5NF/6NF 中,
  • 测试是使用 no_sql_cache 完成的,因此避免了大多数缓存机制(并且在实际情况下它们发挥了作用)在系统性能方面)
  • 空间消耗增加了大约 9x 列名称大小 x 100k 行
  • t 上应该有触发器以保持数据的完整性,这将显着减慢对 name 的所有更新,并添加在 t 中插入的额外检查需要经历
  • 可能更好的结果可以通过删除代理键并切换到自然键和/或索引,或重新设计到更高的 NF 来实现

从长远来看,规范化是正确的方法。但您并不总是可以选择重新设计公司的 ERP(例如,大多数情况下已经只有 3NF) - 有时您必须在给定资源内完成某些任务。当然,这样做只是短期的“解决方案”。

底线

我认为对你的问题最相关的答案是,你会发现行业和教育在

  • 严格意义上使用“非规范化”一词,以松散地打破NF
  • ,用于引入任何插入、更新和删除依赖项(Codd 的原始引用评论关于规范化的说法:“不受欢迎的(!)插入、更新和删除依赖项',请参阅此处

因此,在严格的定义下,聚合(汇总表)是不被视为非规范化,它们在性能方面可以提供很大帮助(任何缓存也一样,这不被视为非规范化)。

宽松的用法涵盖了破坏范式正交设计原则,如前所述。

另一件可能说明的事情是,逻辑模型物理模型之间存在非常重要的区别。

例如,索引存储冗余数据,但没有人认为它们是非规范化,即使是宽松地使用该术语的人也不会认为它们是非规范化,这有两个(相关的)原因:

  • 它们不是逻辑模型的一部分
  • ,它们是透明的,并保证不会破坏逻辑模型的完整性。你的模型

如果你不能正确地对你的逻辑模型进行建模,你最终会得到不一致的数据库——实体之间的关系类型错误(无法表示问题空间)、相互冲突的事实(丢失信息的能力),你应该使用任何你能使用的方法为了获得正确的逻辑模型,它是构建在其之上的所有应用程序的基础。

谓词的规范化、正交和清晰的语义、定义良好的属性、正确识别的函数依赖关系都是避免陷阱的因素。

当涉及到物理实现时,事情会变得更加轻松,从某种意义上说,依赖于非键的物化计算列可能会破坏 3NF,但如果存在保证一致性的机制,则在物理模型中允许它以与索引相同的方式是允许的,但你必须非常仔细证明它的合理性,因为通常规范化会产生相同或更好的全面改进,并且不会产生或更少的负面影响,并且会保持设计清晰(这减少了应用程序开发的时间)和维护成本),从而节省成本,您可以轻松地用于升级硬件,以提高速度,甚至比破坏 NF 所实现的速度还要快。

First my opinions, then some analysis

Opinions
Denormalisation is perceived to help reading data because common use of the word denormalisation often include not only breaking normal forms, but also introducing any insertion, update and deletion dependencies into the system.

This, strictly speaking, is false, see this question/answer, Denormalisation in strict sense mean to break any of the normal forms from 1NF-6NF, other insertion, update and deletion dependencies are addressed with Principle of Orthogonal Design.

So what happens is that people take the Space vs Time tradeoff principle and remember the term redundancy (associated with denormalisation, still not equal to it) and conclude that you should have benefits. This is faulty implication, but false implications do not allow you to conclude the reverse.

Breaking normal forms may indeed speed up some data retrieval (details in analysis below), but as a rule it will also at the same time:

  • favour only specific type of queries and slow down all other access paths
  • increase complexity of the system (which influences not only maintenance of the database itself, but also increases the complexity of applications that consume the data)
  • obfuscate and weaken semantic clarity of the database
  • main point of database systems, as central data representing the problem space is to be unbiased in recording the facts, so that when requirements change you don't have to redesign the parts of the system (data and applications) that are independent in reality. to be able to do this artificial dependencies should be minimised - today's 'critical' requirement to speed up one query quite often become only marginally important.

Analysis

So, I made a claim that sometimes breaking normal forms can help retrieval. Time to give some arguments

1) Breaking 1NF

Assume you have financial records in 6NF. From such database you can surely get a report on what is a balance for each account for each month.

Assuming that a query that would have to calculate such report would need to go through n records you could make a table

account_balances(month, report)

which would hold XML structured balances for each account. This breaks 1NF (see notes later), but allows one specific query to execute with minimum I/O.

At the same time, assuming it is possible to update any month with inserts, updates or deletes of financial records, the performance of the update queries on the system might be slowed down by time proportional to some function of n for each update.
(the above case illustrates a principle, in reality you would have better options and the benefit of getting minimum I/O bring such penalties that for realistic system that actually updates data often you would get bad performance on even for your targeted query depending on the type of actual workload; can explain this in more detail if you want)

Note:
This is actually trivial example and there is one problem with it - the definition of 1NF. Assumption that the above model breaks 1NF is according to requirement that values of an attribute 'contain exactly one value from the applicable domain'.

This allows you to say that the domain of the attribute report is a set of all possible reports and that from all of them there is exactly one value and claim that 1NF is not broken (similar to argument that storing words does not break 1NF even though you might have letters relation somewhere in your model).

On the other hand there are much better ways to model this table, which would be more useful for wider range of queries (such as to retrieve balances for single account for all months in a year). In this case you would justify that improvement by saying that this field is not in 1NF.

Anyway it explains why people claim that breaking NFs might improve performance.

2) Breaking 3NF

Assuming tables in 3NF

CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `member_id` int(10) unsigned NOT NULL,
  `status` tinyint(3) unsigned NOT NULL,
  `amount` decimal(10,2) NOT NULL,
  `opening` decimal(10,2) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `member_id` (`member_id`),
  CONSTRAINT `t_ibfk_1` FOREIGN KEY (`member_id`) REFERENCES `m` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB

CREATE TABLE `m` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB

with sample data (1M rows in t, 100k in m)

Assume a common query that you want to improve

mysql> select sql_no_cache m.name, count(*) 
       from t join m on t.member_id = m.id 
       where t.id between 100000 and 500000 group by m.name;
+-------+----------+
| name  | count(*) |
+-------+----------+
| omega |       11 |
| test  |        8 |
| test3 |   399982 |
+-------+----------+
3 rows in set (1.08 sec)

you could find suggestions to move attribute name into table m which breaks 3NF (it has a FD: member_id -> name and member_id is not a key of t)

after

alter table t add column varchar(255);
update t inner join m on t.member_id = t.id set t.name = m.name;

running

mysql> select sql_no_cache name, count(*) 
       from t where id 
       between 100000 and 500000 
       group by name;
+-------+----------+
| name  | count(*) |
+-------+----------+
| omega |       11 |
| test  |        8 |
| test3 |   399982 |
+-------+----------+
3 rows in set (0.41 sec)

notes:
The above query execution time is cut in half, but

  • the table was not in 5NF/6NF to begin with
  • the test was done with no_sql_cache so most cache mechanisms were avoided (and in real situations they play a role in system's performance)
  • space consumption is increased by approx 9x size of the column name x 100k rows
  • there should be triggers on t to keep the integrity of data, which would significantly slow down all updates to name and add additional checks that inserts in t would need to go through
  • probably better results could be achieved by dropping surrogate keys and switching to natural keys, and/or indexing, or redesigning to higher NFs

Normalising is the proper way in the long run. But you don't always have an option to redesign company's ERP (which is for example already only mostly 3NF) - sometimes you must achieve certain task within given resources. Of course doing this is only short term 'solution'.

Bottom line

I think that the most pertinent answer to your question is that you will find the industry and education using the term 'denormalisation' in

  • strict sense, for breaking NFs
  • loosely, for introducing any insertion, update and deletion dependencies (original Codd's quote comments on normalisation saying: 'undesirable(!) insertion, update and deletion dependencies', see some details here)

So, under strict definition, the aggregation (summary tables) are not considered denormalisation and they can help a lot in terms of performance (as will any cache, which is not perceived as denormalisation).

The loose usage encompasses both breaking normal forms and the principle of orthogonal design, as said before.

Another thing that might shed some light is that there is a very important difference between the logical model and the physical model.

For example indexes store redundant data, but no one considers them denormalization, not even people who use the term loosely and there are two (connected) reasons for this

  • they are not part of the logical model
  • they are transparent and guaranteed not to break integrity of your model

If you fail to properly model your logical model you will end up with inconsistent database - wrong types of relationships between your entities (inability to represent problem space), conflicting facts (ability to loose information) and you should employ whatever methods you can to get a correct logical model, it is a foundation for all applications that will be built on top of it.

Normalisation, orthogonal and clear semantics of your predicates, well defined attributes, correctly identified functional dependencies all play a factor in avoiding pitfalls.

When it comes to physical implementation things get more relaxed in a sense that ok, materialised computed column that is dependent on non key might be breaking 3NF, but if there are mechanisms that guarantee consistency it is allowed in physical model in the same way as indexes are allowed, but you have to very carefully justify it because usually normalising will yield same or better improvements across the board and will have no or less negative impact and will keep the design clear (which reduces the application development and maintenance costs) resulting in savings that you can easily spend on upgrading hardware to improve the speed even more then what is achieved with breaking NFs.

可爱咩 2024-10-14 05:08:38

构建数据仓库 (DW) 的两种最流行的方法似乎是 Bill Inmon 和 Ralph Kimball 的方法。

Inmon 的方法使用标准化方法,而 Kimball 的方法使用维度建模——非标准化星型模式。

两者都有详细的记录,甚至都有很多成功的实施。两者都为 DW 目的地提供了一条“宽阔、铺好的道路”。

我无法评论 6NF 方法或锚定建模,因为我从未见过或参与过使用该方法的 DW 项目。当谈到实现时,我喜欢沿着经过良好测试的路径前进——但是,这只是我自己的想法。

那么,总结一下,DW 应该标准化还是非标准化?取决于您选择的方法 - 只需选择一种方法并坚持下去,至少直到项目结束。

编辑 - 示例

在我目前工作的地方,我们有一个遗留报告,该报告一直在生产服务器上运行。这不是一份简单的报告,而是每天通过电子邮件发送给每个人和他的蚂蚁的 30 个子报告的集合。

最近,我们实现了一个DW。有了两个报表服务器和一堆报表,我希望我们可以忘记遗留的事情。但不是,遗产就是遗产,我们一直拥有它,所以我们想要它,需要它,没有它就活不下去,等等。

问题是,Python 脚本和 SQL 的混乱花了八个小时(是的,八个小时)小时)每天运行。不用说,数据库和应用程序是由几批开发人员多年来构建的 - 因此,不完全是您的 5NF。

是时候重新创建 DW 遗留下来的东西了。好的,为了简短起见,它已经完成了,需要 3 分钟(三分钟)来生成它,每个子报告需要 6 秒。而且我急于交付,所以甚至没有优化所有查询。速度提高了 8 * 60 / 3 = 160 倍——更不用说从生产服务器中删除 8 小时作业的好处了。我想我还能节省一分钟左右的时间,但现在没人在乎。

有趣的是,我在 DW 中使用了 Kimball 的方法(维度建模),并且这个故事中使用的所有内容都是开源的。

我认为这就是所有这些(数据仓库)的意义所在。使用哪种方法(标准化或非标准化)是否重要?

编辑2

作为一个兴趣点,Bill Inmon 在他的网站上有一篇写得很好的论文 - 两种架构的故事

The two most popular methodologies for building a data warehouse (DW) seem to be Bill Inmon's and Ralph Kimball's.

Inmon's methodology uses normalized approach, while Kimball's uses dimensional modelling -- de-normalized star schema.

Both are well documented down to small details and both have many successful implementations. Both present a "wide, well-paved road" to a DW destination.

I can not comment on the 6NF approach nor on Anchor Modelling because I have never seen nor participated in a DW project using that methodology. When it comes to implementations, I like to travel down well tested paths -- but, that's just me.

So, to summarize, should DW be normalized or de-normalized? Depends on the methodology you pick -- simply pick one and stick to it, at least till the end of the project.

EDIT - An Example

At the place I currently work for, we had a legacy report which has been running since ever on the production server. Not a plain report, but a collection of 30 sub-reports emailed to everybody and his ant every day.

Recently, we implemented a DW. With two report servers and bunch of reports in place, I was hoping that we can forget about the legacy thing. But not, legacy is legacy, we always had it, so we want it, need it, can't live without it, etc.

The thing is that the mess-up of a python script and SQL took eight hours (yes, e-i-g-h-t hours) to run every single day. Needless to say, the database and the application were built over years by few batches of developers -- so, not exactly your 5NF.

It was time to re-create the legacy thing from the DW. Ok, to keep it short it's done and it takes 3 minutes (t-h-r-e-e minutes) to produce it, six seconds per sub-report. And I was in the hurry to deliver, so was not even optimizing all the queries. This is factor of 8 * 60 / 3 = 160 times faster -- not to mention benefits of removing an eight hour job from a production server. I think I can still shave of a minute or so, but right now no one cares.

As a point of interest, I have used Kimball's method (dimensional modelling) for the DW and everything used in this story is open-source.

This is what all this (data-warehouse) is supposed to be about, I think. Does it even matter which methodology (normalized or de-normalized) was used?

EDIT 2

As a point of interest, Bill Inmon has a nicely written paper on his website -- A Tale of Two Architectures.

在巴黎塔顶看东京樱花 2024-10-14 05:08:38

“非规范化”这个词的问题在于它没有指定前进的方向。这就像试图从纽约开车从芝加哥到达旧金山一样。

星型模式或雪花型模式当然没有标准化。在某些使用模式下,它的性能肯定比规范化模式更好。但在某些非规范化的情况下,设计者根本不遵循任何规则,而只是凭直觉构建表格。有时这些努力不会成功。

简而言之,不要只是非规范化。如果您对不同的设计准则的好处充满信心,即使它不符合规范化设计,也请遵循不同的设计准则。但不要用非规范化作为随意设计的借口。

The problem with the word "denormalized" is that it doesn't specify what direction to go in. It's about like trying to get to San Francisco from Chicago by driving away from New York.

A star schema or a snowflake schema is certainly not normalized. And it certainly performs better than a normalized schema in certain usage patterns. But there are cases of denormalization where the designer wasn't following any discipline at all, but just composing tables by intuition. Sometimes those efforts don't pan out.

In short, don't just denormalize. Do follow a different design discipline if you are confident of its benefits, and even if it doesn't agree with normalized design. But don't use denormalization as an excuse for haphazard design.

谁的新欢旧爱 2024-10-14 05:08:38

简短的回答是不要解决尚未解决的性能问题

对于基于时间的表,普遍接受的范式是每行都有 valid_from 和 valid_to 日期。这基本上仍然是 3NF,因为它只是将语义从“这是该实体的唯一版本”更改为“这是该实体目前唯一的版本”

The short answer is don't fix a performance problem you have not got!

As for time based tables the generally accepted pardigm is to have valid_from and valid_to dates in every row. This is still basically 3NF as it only changes the semantics from "this is the one and only verision of this entity" to "this is the one and only version of this entity at this time "

东京女 2024-10-14 05:08:38

简化:

OLTP 数据库应该标准化(只要有意义)。

OLAP 数据仓库应非规范化为事实表和维度表(以最小化连接)。

Simplification:

An OLTP database should be normalised (as far as makes sense).

An OLAP data warehouse should be denormalised into Fact and Dimension tables (to minimise joins).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文