与可空列或单独的表相比,使用键/值表有哪些优势?

发布于 2024-09-29 10:10:34 字数 757 浏览 9 评论 0原文

我正在升级我不久前创建的支付管理系统。目前,它针对其可接受的每种付款类型都有一个表。它仅限于只能支付一件事情,这次升级就是为了缓解这一问题。我一直在寻求关于如何设计它的建议,我有以下基本想法:

  1. 为每种付款类型准备一张表,每种付款类型都有一些公共列。 (当前设计)
  2. 使用一个中央表来协调所有付款,该表采用公共列(统一付款 ID,无论类型如何),并标识另一个表和行 ID,该表和行 ID 具有专用于该付款类型的列。
  3. 为所有付款类型提供一张表,并将未用于任何给定类型的列清空。
  4. 使用中心表的想法,但将专用列存储在键/值表中。

我的目标是:速度不要慢得离谱,尽可能多地自我记录,在保持其他目标的同时最大限度地提高灵活性。

我不太喜欢 1,因为每个表中都有重复的列。它反映了支付类型类继承了为所有支付类型提供功能的基类...相反的ORM?

我最倾向于 2,因为它与当前设计一样“类型安全”和自我记录。但是,与 1 一样,要添加新的付款类型,我需要添加一个新表。

我不喜欢 3,因为它“浪费空间”,并且不能立即清楚哪些列用于哪些付款类型。文档可以在一定程度上减轻这种痛苦,但是我公司的内部工具没有有效的方法来存储/查找技术文档。

我对 4 给出的论点是,它可以减轻添加新支付方式时更改数据库的需要,但由于缺乏明确性,它比 3 更糟糕。目前,更改数据库不是问题,但如果我们决定开始让客户保留自己的数据库,这可能会成为后勤噩梦。

所以,我当然有我的偏见。有人有更好的想法吗?您认为哪种设计最适合?我应该根据什么标准做出决定?

I'm upgrading a payment management system I created a while ago. It currently has one table for each payment type it can accept. It is limited to only being able to pay for one thing, which this upgrade is to alleviate. I've been asking for suggestions as to how I should design it, and I have these basic ideas to work from:

  1. Have one table for each payment type, with a few common columns on each. (current design)
  2. Coordinate all payments with a central table that takes on the common columns (unifying payment IDs regardless of type), and identifies another table and row ID that has columns specialized to that payment type.
  3. Have one table for all payment types, and null the columns which are not used for any given type.
  4. Use the central table idea, but store specialized columns in a key/value table.

My goals for this are: not ridiculously slow, self-documenting as much as possible, and maximizing flexibility while maintaining the other goals.

I don't like 1 very much because of the duplicate columns in each table. It reflects the payment type classes inheriting a base class that provides functionality for all payment types... ORM in reverse?

I'm leaning toward 2 the most, because it's just as "type safe" and self-documenting as the current design. But, as with 1, to add a new payment type, I need to add a new table.

I don't like 3 because of its "wasted space", and it's not immediately clear which columns are used for which payment types. Documentation can alleviate the pain of this somewhat, but my company's internal tools do not have an effective method for storing/finding technical documentation.

The argument I was given for 4 was that it would alleviate needing to change the database when adding a new payment method, but it suffers even worse than 3 does from the lack of explicitness. Currently, changing the database isn't a problem, but it could become a logistical nightmare if we decide to start letting customers keep their own database down the road.

So, of course I have my biases. Does anyone have any better ideas? Which design do you think fits best? What criteria should I base my decision on?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

浴红衣 2024-10-06 10:10:34

注意
这个主题正在讨论中,并且这个帖子正在其他帖子中被引用,因此我已经对其进行了合理的处理,请耐心等待。我的目的是提供理解,以便您能够做出明智的决定,而不是仅仅基于标签做出简单化的决定。如果你觉得它很激烈,请在闲暇时分块阅读;当你饿了的时候才回来,而不是之前。

EAV 到底什么是“坏”?

1 简介

EAV 和 EAV 之间有区别(实体-属性-值模型)做得好和做得不好,就像 3NF 做得好和做得不好之间有区别一样。在我们的技术工作中,我们需要准确地了解什么有效,什么无效;关于什么表现良好,什么表现不佳。一揽子声明是危险的,会误导人们,从而阻碍进展和对有关问题的普遍理解。

我不支持也不反对任何事情,除了非熟练工人实施不当以及歪曲标准合规程度之外。当我看到误解时,就像这里一样,我将尝试解决它。

规范化也经常被误解,所以对此略说一下。维基百科和其他免费资源实际上发布了完全无意义的“定义”,没有学术基础,有供应商偏见,以验证其不符合标准的产品。有一个科德发表了他的十二条规则。我至少实现了 5NF,这足以满足大多数要求,因此我将使用它作为基准。简单地说,假设读者理解第三范式(至少该定义不会混淆)...

2 第五范式

2.1 定义

第五范式定义为:

  • 在表中或任何其他表中,每列仅与主键存在 1::1 关系,
  • 与其他列不存在 1::
  • 1 关系,结果在任何地方都没有重复的列;无更新异常(不需要触发器或复杂的代码来确保更新列时,其重复项得到正确更新)。
  • 它提高了性能,因为(a)它影响更少的行,并且(b)由于减少锁定而提高了并发性。

我区分的是,数据库是否标准化为特定的 NF 不是问题;数据库只是标准化的。 每个表都被归一化为特定的 NF:某些表可能只需要 1NF,其他表可能只需要 3NF,还有一些表需要 5NF。

2.2 性能

曾经有一段时间,人们认为规范化不能提供性能,他们不得不“为了性能而去规范化”。感谢上帝,这个神话已经被揭穿,今天大多数 IT 专业人员都意识到规范化数据库的性能更好。数据库供应商针对规范化数据库进行优化,而不是针对非规范化文件系统。 “非规范化”的真相是,数据库一开始就没有规范化(并且性能很差),它是非规范化的,并且他们做了一些进一步的加扰来提高性能。为了非规范化,必须首先忠实地规范化,但这从未发生过。我重写了许多这样的“非规范化性能”数据库,仅提供忠实的规范化,并且它们的运行速度至少快十倍,最多快一百倍。此外,它们只需要一小部分磁盘空间。它是如此平淡无奇,所以我保证以书面形式进行练习。

2.3 限制

限制,或者更确切地说,5NF 的全部范围是:

  • 它不处理可选值,并且必须使用 Null(许多设计者不允许 Null 并使用替代品,但是如果不处理的话,这就会有限制) ,但
  • 您仍然需要更改 DDL 来添加或更改列(并且在实施后,添加最初未识别的列的要求越来越多;更改控制很繁重)
  • 尽管提供了最高水平的性能 由于规范化(即:消除重复和混乱的关系),复杂的查询,例如数据透视(生成行报告或行摘要,以列表示)和数据仓库操作所需的“列访问”,是很困难的,仅这些操作表现不佳。这并不是说这仅仅是由于可用的 SQL 技能水平,而不是引擎。

3 第六范式

3.1 定义

第六范式定义为:

  • 关系(行)是主键加上最多一个属性(列),

称为< em>不可约范式,最终的 NF,因为没有可以执行的进一步规范化。尽管它在九十年代中期就在学术界进行了讨论,但直到 2003 年才被正式宣布。对于那些喜欢通过混淆关系、相关变量、“关系”等来贬低关系模型的形式的人来说:所有这些废话都可以之所以被搁置,是因为从形式上来说,上述定义标识了不可约关系,有时称为原子关系。

3.2 进展

6NF 提供的增量(5NF 没有)是:

  • 对可选值的正式支持,从而消除空问题
    • 副作用是,无需更改 DDL 即可添加列(稍后详细介绍)
  • 轻松旋转
  • 简单直接的列式访问
    • 它允许(不是以其普通形式)该部门的更高水平的绩效

让我说,我(和其他人)在 20 年前就提供了增强的 5NF 表,明确用于旋转,没有任何问题,因此允许(a) 使用简单的 SQL 以及 (b) 提供非常高的性能;很高兴知道业界的学术巨头已经正式定义了我们正在做的事情。一夜之间,我的 5NF 表被重新命名为 6NF,我毫不费力。其次,我们只在需要的地方这样做;同样,规范化为 6NF 的是表,而不是数据库。

3.3 SQL 限制

这是一种麻烦的语言,尤其是重新连接,并且做任何适度复杂的事情都会使它非常麻烦。 (这是一个单独的问题,大多数程序员不理解或不使用子查询。)它支持 5NF 所需的结构,但仅此而已。为了实现稳健且稳定的实现,必须实现附加标准,其中可能部分包含附加目录表。 SQL 的“使用期限”在 20 世纪 90 年代初就已经过去了。它完全不支持 6NF 表,并且迫切需要更换。但这就是我们所拥有的一切,所以我们需要处理它

对于我们这些一直在实施标准和附加目录表的人来说,扩展我们的目录以提供支持 6NF 结构到标准所需的功能并不是一项认真的工作:哪些列属于哪些表,以及按什么顺序;强制/可选;显示格式;本质上是一个完整的元数据目录,与 SQL 目录结合在一起。

请注意,每个 NF 都包含其中之前的每个 NF,因此 6NF 包含 5NF。我们并没有为了提供 6NF 而打破 5NF,我们提供了 5NF 的进展;对于 SQL 不足的地方,我们提供了目录。这意味着基本约束,例如外键;通过 SQL 声明性引用完整性提供的值域;数据类型;检查; 5NF 级别的规则保持不变,这些约束没有被破坏。符合标准的5NF数据库的高质量和高性能并没有因为引入6NF而降低。

3.4 目录

保护用户(任何报告工具)和开发人员免于处理从 5NF 到 6NF 的跳跃(他们的工作是应用程序编码极客,这是我的工作),这一点很重要工作是成为数据库极客)。即使在 5NF,这始终是我的设计目标:一个具有最小数据目录的正确规范化数据库实际上非常易于使用,我绝不会放弃它。请记住,由于正常的维护和扩展,6NF 结构会随着时间的推移而发生变化,数据库的新版本会定期发布。毫无疑问,从 6NF 表构造 5NF 行所需的 SQL(在 5NF 中已经很麻烦)甚至更加麻烦。值得庆幸的是,这是完全没有必要的。

由于我们已经有了目录,它标识了 SQL 未提供的完整 6NF-DDL,如果您愿意的话,我编写了一个小实用程序来读取目录并:

  • 生成 6NF 表 DDL。
  • 生成 6NF 表的 5NF 视图。这使得用户可以幸福地不知道它们,并为他们提供与 5NF 相同的功能和性能,
  • 生成针对 6NF 结构进行操作所需的完整 SQL(不是模板,我们单独有这些),然后编码人员可以使用这些 SQL 。他们从原本需要的单调和重复中解脱出来,可以自由地专注于应用程序逻辑。

我没有为 Pivoting 编写实用程序,因为 5NF 中存在的复杂性已被消除,并且它们现在编写起来非常简单,就像 5NF-enhanced-for-pivoting 一样。此外,大多数报告工具都提供旋转功能,因此我只需要提供包含大量统计数据的功能,这些功能需要在发送到客户端之前在服务器上执行。

3.5 表现

每个人都有自己的十字架;我碰巧对性能很着迷。我的 5NF 数据库表现良好,因此我向您保证,在将任何内容投入生产之前,我运行的基准测试数量远多于必要的数量。 6NF 数据库的性能与 5NF 数据库完全相同,没有更好,也没有更差。这并不奇怪,因为“复杂”6NF SQL 所做的唯一事情(5NF SQL 没有做)是执行更多的联接和子查询。

您必须检查这些神话。

  • 任何对这个问题进行基准测试的人(即检查过查询的执行计划)会知道Joins Cost Nothing,这是一个编译时分辨率,它们在执行时没有影响,
  • 当然,连接的表的数量;正在连接的表;是否可以使用索引;正在连接的键的分布等等,
  • 但连接本身没有任何成本
  • 。在同一个数据库中的十个(较小的)表上,如果它是规范化的,那么四个或九个连接都不会产生任何费用;它们不会影响性能问题;每个连接上的选定集都会影响其中

。 strong>3.6 优点

  1. 不受限制的柱状访问,这就是 6NF 真正脱颖而出的地方。直列式访问速度非常快,无需将数据导出到数据仓库即可从专门的 DW 结构中获得速度。

    我对一些 DW 的研究(绝不完整)表明,它们始终按列存储数据,而不是按行存储数据,这正是 6NF 所做的。我是保守派,所以我不打算做出任何声明 6NF 将取代 DW,但就我而言,它消除了对 6NF 的需求。

  2. 比较 6NF 中可用的功能和 5NF 中不可用的功能(例如旋转)是不公平的,后者显然运行得更快。

这是我们第一个真正的 6NF 数据库(具有完整的目录等;而不是始终只在必要时进行增强的 5NF;后来证明是 6NF),客户非常满意。当然,我在交付后监视了一段时间的性能,并为我的下一个 6NF 项目确定了一种更快的列式访问方法。当我这样做时,可能会给 DW 市场带来一些竞争。客户还没有准备好,我们不会修复未损坏的部分。

3.7 6NF 到底什么是“不好的”?

请注意,并不是每个人都会接近这项工作具有同样的形式、结构和遵守标准。因此,从我们的项目中得出所有 6NF 数据库都性能良好且易于维护的结论是愚蠢的。 (通过查看其他人的实现)得出所有 6NF 数据库性能都很差、难以维护的结论也是愚蠢的;灾难。与往常一样,任何技术努力所产生的性能和易于维护性都严格取决于形式、结构和对标准的遵守,以及相关的技能。

4 实体属性值

披露:经验。我检查了其中一些,主要是医院和医疗系统。我对其中两个进行了纠正作业。海外提供商的初始交付虽然不是很好,但相当充足,但本地提供商实施的扩展却一团糟。但与人们在该网站上发布的关于 re EAV 的灾难相比,还差得远。几个月的紧张工作很好地修复了它们。

4.1 它是什么

对我来说很明显,我所从事的 EAV 实现只是第六范式的子集。实施 EAV 的人这样做是因为他们想要 6NF 的一些功能(例如,无需更改 DDL 即可添加列的能力),但他们不具备实施真正 6NF 的学术知识或标准以及安全地实施和管理它的结构。即使最初的提供商也不知道 6NF,也不知道 EAV 是 6NF 的子集,但当我向他们指出这一点时,他们欣然同意。由于高效且有效地提供 EAV 以及实际上 6NF 所需的结构(目录、视图、自动代码生成)并未在 EAV 社区中正式确定,并且在大多数实现中都缺失,因此我将 EAV 归类为“私生子”第六范式。

4.2 关于 EAV,到底什么是“坏”?

根据本主题和其他主题中的评论,是的,EAV 做得不好就是一场灾难。更重要的是(a)它们太糟糕了,以至于失去了 5NF(忘记 6NF)提供的性能,并且(b)尚未实现与复杂性的普通隔离(编码人员和用户“被迫”使用繁琐的导航)。如果他们没有实施目录,就无法避免各种可预防的错误。

对于糟糕的(EAV 或其他)实现来说,所有这些很可能都是正确的,但它与 6NF 或 EAV 无关。我工作的两个项目都有相当足够的性能(当然,它可以改进;但是没有由于 EAV 导致性能不佳),并且很好地隔离了复杂性。当然,它们的质量或性能远不及我的 5NF 数据库或真正的 6NF 数据库,但考虑到 EAV 社区内对已发布问题的理解水平,它们已经足够公平了。它们不是这些页面中所谓的 EAV 的灾难和不合标准的废话。

5 Nulls

有一个众所周知且有记录的问题,称为“Null Problem”。它本身就值得一篇文章。对于这篇文章,我只想说:

  • 问题实际上是可选值或缺失值;这里要考虑的是表设计,这样没有 Null 与 Nullable 列
  • 实际上并不重要,因为无论您是否使用 Nulls/No Nulls/6NF 来排除缺失值,您都必须为此编写代码,那么问题就在于处理缺失值,这是无法规避的
    • 当然,纯 6NF 除外,它消除了空问题
    • 处理缺失值的编码仍然存在
      • 除了自动生成 SQL 代码之外,呵呵

  • 空值对于性能来说是个坏消息,我们中的许多人几十年前就决定不允许数据库中存在空值(传递参数和结果集中的空值,以指示缺失值,没问题)
    • 这意味着一组空替换和布尔列来指示缺失值
  • Null 导致原本固定的 len 列变为可变 len;变量 len 列不应该在索引中使用,因为在遍历或潜水期间,必须对每个索引条目的每次访问执行一些“解包”。

6 立场

我不是 EAV 或 6NF 的支持者,我是质量和标准的支持者。我的立场是:

  1. 始终以各种方式按照您意识到的最高标准做您正在做的事情。

  2. 对于关系数据库来说,规范化到第三范式是最小的(对我来说是 5NF)。数据类型、声明性引用完整性、事务、规范化都是数据库的基本要求;如果缺少,则它不是数据库。

    • 如果你必须“为了性能而进行非规范化”,那么你就犯了严重的规范化错误,你的设计没有规范化。时期。不要“去规范化”,相反,学习Normalization和Normalise。
  3. 无需做额外的工作。如果 5NF 可以满足您的要求,则无需实施更多。如果您需要可选值或能够在不更改 DDL 的情况下添加列或完全消除 Null 问题,请仅在需要它们的表中实施 6NF。

  4. 如果您这样做,仅由于 SQL 没有为 6NF 提供适当的支持,您将需要实现:

    • 简单而有效的目录(列混淆和数据完整性丢失是根本不可接受的)
    • 通过视图对 6NF 表进行 5NF 访问,以将用户(和开发人员)与受阻碍(非“复杂”)的 SQL 隔离开
    • 编写或购买实用程序,以便您可以生成繁琐的 SQL 来从 6NF 表构造 5NF 行,并避免编写相同的内容
    • 测量、监控、诊断和改进。如果您遇到性能问题,则说明您犯了 (a) 归一化错误或 (b) 编码错误。时期。退一步并修复它。
  5. 如果您决定使用 EAV,请认识到它是什么,6NF,并正确实施它,如上所述。如果你这样做了,那么你将拥有一个成功的项目,这是有保证的。如果你不这样做,你保证会吃到狗的早餐。

6.1 天下没有免费的午餐

这句格言曾被提及,但实际上它被滥用了。它实际、深入应用的方式如上所述:如果您想要 6NF/EAV 的好处,您最好也愿意做获得它所需的工作(目录、标准)。当然,推论是,如果你不做工作,你就不会得到好处。数据类型不会“丢失”;值域;外键;检查;规则。关于性能,6NF/EAV 没有性能损失,但不合标准的工作总是会出现相当大的性能损失。

7个具体问题

最后。适当考虑上述背景,并且这是一个小团队的小项目,毫无疑问:

  • 不要使用 EAV(或 6NF)

  • 不要使用 Null 或 Nullable 列(除非您希望破坏性能)

  • 请对公共付款列使用单个付款表

  • ,对每个付款类型使用一个子表,每个表都有其特定的列

  • 全部完全类型转换和约束。

  • 这个“另一个 row_id”是什么业务?为什么有些人在所有会动的东西上贴上身份证,而不检查它是鹿还是鹰? 否。该孩子是受抚养子女。关系为 1::1。子级的 PK 是父级的 PK,即公共支付表。这是一个普通的Supertype-Subtype集群,Differentiator是PaymentTypeCode。子类型和超类型是关系模型的普通部分,并且在数据库以及任何优秀的建模工具中都得到了充分的满足。

    当然,那些对关系数据库一无所知的人会认为他们在 30 年后发明了它,并给它起了有趣的新名字。或者更糟糕的是,他们故意重新标记它并声称它是他们自己的。直到一些可怜的草皮,受过一点教育和职业自豪感,揭露了无知或欺诈。我不知道它是哪一个,但它是其中之一;我只是陈述事实,很容易证实。

A.对评论的回复

A.1 归属

  1. 我没有个人或私人或特殊的定义。关于以下定义(例如祈使句)的所有声明:
    • 标准化,
    • 普通形式,以及
    • 关系模型。
      .
      请参阅 EF Codd 和 CJ Date 的许多原始文本(无法在网络上免费获得)
      .
      最新的内容是 CJ Date、Hugh Darwen、Nikos A Lorentzos 的时态数据和关系模型
      .
      除了那些文字什么也没有
      .
      “我站在巨人的肩膀上”
      .
  2. 上述内容的本质、主体以及所有关于实现的陈述(例如主观、第一人称)均基于经验;作为商业组织(受薪顾问或运营咨询公司),在美国和澳大利亚的大型金融机构中实施上述原则和概念已超过 32 年。
    • 这包括纠正或替换不合标准或非关系实现的大量大型作业。
      .
  3. 零问题与第六范式
    与标题相关的免费白皮书(它并没有单独定义空问题)可以在以下位置找到:
    http://www.dcs.warwick。 ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf
    .
    6NF 的“简而言之”定义(对于那些有其他 NF 经验的人来说有意义)可以在第 6 页

A.2 支持证据

  1. 如开头所述,这篇文章的目的是反击这个社区中普遍存在的错误信息,作为对社区的服务。

  2. 如果确定了具体声明,则可以提供支持上述原则实施的声明的证据;同样,其他人发布的不正确的言论(本文是对此的回应)同样得到了证明。如果要打包子大战,我们要确保比赛场地是公平的

  3. 这里有一些我可以立即使用的文档。

    a. 大型银行
    这是最好的例子,因为它是出于本文中明确的原因而进行的,并且目标已经实现。他们有 Sybase IQ(DW 产品)的预算,但当我们完成项目时报告速度太快,他们不需要它。交易分析统计数据是我的 5NF 加上旋转扩展,结果是 6NF,如上所述。我认为评论中提出的所有问题都已在文档中得到解答,除了:
    - 行数:
    - 旧数据库未知,但可以从其他统计数据推断
    - 新数据库 = 20 个超过 100M 的表,4 个超过 10B 的表。

    b. 小型金融机构 A 部分
    B 部分 -肉
    C 部分 - 引用图表
    D 部分 - 附录, 之前/之后的索引审核(每个索引 1 行)
    注意四个文档;第四个仅适用于那些希望检查详细指数变化的人。他们正在运行一个无法更改的第 3 方应用程序,因为本地供应商倒闭了,另外还有 120% 的扩展,他们可以但不想更改。我们被叫来是因为他们升级到了 Sybase 的新版本,速度更快,改变了各种性能阈值,从而导致了大量的死锁。在这里,我们对服务器中的所有内容除了数据库模型进行了绝对规范化,目标是(事先保证)消除死锁(抱歉,我不会在这里解释这一点:那些争论“非规范化”的人问题,关于这一点将采用粉红色)。它包括“将表拆分为存档数据库以提高性能”的逆转,这是另一篇文章的主题(是的,新的单个表比两个溢出的表执行得更快)。此练习也适用于 MS SQL Server [插入重写版本]。

    c. 耶鲁纽黑文医院
    那是耶鲁大学医学院,他们的教学医院。这是 Sybase 之上的第三方应用程序。统计数据的问题是,80% 的时间他们仅在指定的测试时间收集快照,但没有一致的历史记录,因此没有“之前的图像”来与我们新的一致统计数据进行比较。据我所知,没有任何其他公司能够以自动方式在同一图表上获取 Unix 和 Sybase 内部统计数据。现在网络就是门槛(这是一件好事)。

Note
This subject is being discussed, and this thread is being referenced in other threads, therefore I have given it a reasonable treatment, please bear with me. My intention is to provide understanding, so that you can make informed decisions, rather than simplistic ones based merely on labels. If you find it intense, read it in chunks, at your leisure; come back when you are hungry, and not before.

What, exactly, about EAV, is "Bad" ?

1 Introduction

There is a difference between EAV (Entity-Attribute-Value Model) done properly, and done badly, just as there is a difference between 3NF done properly and done badly. In our technical work, we need to be precise about exactly what works, and what does not; about what performs well, and what doesn't. Blanket statements are dangerous, misinform people, and thus hinder progress and universal understanding of the issues concerned.

I am not for or against anything, except poor implementations by unskilled workers, and misrepresenting the level of compliance to standards. And where I see misunderstanding, as here, I will attempt to address it.

Normalisation is also often misunderstood, so a word on that. Wikipedia and other free sources actually post completely nonsensical "definitions", that have no academic basis, that have vendor biases so as to validate their non-standard-compliant products. There is a Codd published his Twelve Rules. I implement a minimum of 5NF, which is more than enough for most requirements, so I will use that as a baseline. Simply put, assuming Third Normal Form is understood by the reader (at least that definition is not confused) ...

2 Fifth Normal Form

2.1 Definition

Fifth Normal Form is defined as:

  • every column has a 1::1 relation with the Primary Key, only
  • and to no other column, in the table, or in any other table
  • the result is no duplicated columns, anywhere; No Update Anomalies (no need for triggers or complex code to ensure that, when a column is updated, its duplicates are updated correctly).
  • it improves performance because (a) it affects less rows and (b) improves concurrency due to reduced locking

I make the distinction that, it is not that a database is Normalised to a particular NF or not; the database is simply Normalised. It is that each table is Normalised to a particular NF: some tables may only require 1NF, others 3NF, and yet others require 5NF.

2.2 Performance

There was a time when people thought that Normalisation did not provide performance, and they had to "denormalise for performance". Thank God that myth has been debunked, and most IT professionals today realise that Normalised databases perform better. The database vendors optimise for Normalised databases, not for denormalised file systems. The truth "denormalised" is, the database was NOT normalised in the first place (and it performed badly), it was unnormalised, and they did some further scrambling to improve performance. In order to be Denormalised, it has to be faithfully Normalised first, and that never took place. I have rewritten scores of such "denormalised for performance" databases, providing faithful Normalisation and nothing else, and they ran at least ten, and as much as a hundred times faster. In addition, they required only a fraction of the disk space. It is so pedestrian that I guarantee the exercise, in writing.

2.3 Limitation

The limitations, or rather the full extent of 5NF is:

  • it does not handle optional values, and Nulls have to be used (many designers disallow Nulls and use substitutes, but this has limitations if it not implemented properly and consistently)
  • you still need to change DDL in order to add or change columns (and there are more and more requirements to add columns that were not initially identified, after implementation; change control is onerous)
  • although providing the highest level of performance due to Normalisation (read: elimination of duplicates and confused relations), complex queries such as pivoting (producing a report of rows, or summaries of rows, expressed as columns) and "columnar access" as required for data warehouse operations, are difficult, and those operations only, do not perform well. Not that this is due only to the SQL skill level available, and not to the engine.

3 Sixth Normal Form

3.1 Definition

Sixth Normal Form is defined as:

  • the Relation (row) is the Primary Key plus at most one attribute (column)

It is known as the Irreducible Normal Form, the ultimate NF, because there is no further Normalisation that can be performed. Although it was discussed in academic circles in the mid nineties, it was formally declared only in 2003. For those who like denigrating the formality of the Relational Model, by confusing relations, relvars, "relationships", and the like: all that nonsense can be put to bed because formally, the above definition identifies the Irreducible Relation, sometimes called the Atomic Relation.

3.2 Progression

The increment that 6NF provides (that 5NF does not) is:

  • formal support for optional values, and thus, elimination of The Null Problem
    • a side effect is, columns can be added without DDL changes (more later)
  • effortless pivoting
  • simple and direct columnar access
    • it allows for (not in its vanilla form) an even greater level of performance in this department

Let me say that I (and others) were supplying enhanced 5NF tables 20 years ago, explicitly for pivoting, with no problem at all, and thus allowing (a) simple SQL to be used and (b) providing very high performance; it was nice to know that the academic giants of the industry had formally defined what we were doing. Overnight, my 5NF tables were renamed 6NF, without me lifting a finger. Second, we only did this where we needed it; again, it was the table, not the database, that was Normalised to 6NF.

3.3 SQL Limitation

It is a cumbersome language, particularly re joins, and doing anything moderately complex makes it very cumbersome. (It is a separate issue that most coders do not understand or use subqueries.) It supports the structures required for 5NF, but only just. For robust and stable implementations, one must implement additional standards, which may consist in part, of additional catalogue tables. The "use by" date for SQL had well and truly elapsed by the early nineties; it is totally devoid of any support for 6NF tables, and desperately in need of replacement. But that is all we have, so we need to just Deal With It.

For those of us who had been implementing standards and additional catalogue tables, it was not a serious effort to extend our catalogues to provide the capability required to support 6NF structures to standard: which columns belong to which tables, and in what order; mandatory/optional; display format; etc. Essentially a full MetaData catalogue, married to the SQL catalogue.

Note that each NF contains each previous NF within it, so 6NF contains 5NF. We did not break 5NF in order provide 6NF, we provided a progression from 5NF; and where SQL fell short we provided the catalogue. What this means is, basic constraints such as for Foreign Keys; and Value Domains which were provided via SQL Declarative Referential integrity; Datatypes; CHECKS; and RULES, at the 5NF level, remained intact, and these constraints were not subverted. The high quality and high performance of standard-compliant 5NF databases was not reduced in anyway by introducing 6NF.

3.4 Catalogue

It is important to shield the users (any report tool) and the developers, from having to deal with the jump from 5NF to 6NF (it is their job to be app coding geeks, it is my job to be the database geek). Even at 5NF, that was always a design goal for me: a properly Normalised database, with a minimal Data Directory, is in fact quite easy to use, and there was no way I was going to give that up. Keep in mind that due to normal maintenance and expansion, the 6NF structures change over time, new versions of the database are published at regular intervals. Without doubt, the SQL (already cumbersome at 5NF) required to construct a 5NF row from the 6NF tables, is even more cumbersome. Gratefully, that is completely unnecessary.

Since we already had our catalogue, which identified the full 6NF-DDL-that-SQL-does-not-provide, if you will, I wrote a small utility to read the catalogue and:

  • generate the 6NF table DDL.
  • generate 5NF VIEWS of the 6NF tables. This allowed the users to remain blissfully unaware of them, and gave them the same capability and performance as they had at 5NF
  • generate the full SQL (not a template, we have those separately) required to operate against the 6NF structures, which coders then use. They are released from the tedium and repetition which is otherwise demanded, and free to concentrate on the app logic.

I did not write an utility for Pivoting because the complexity present at 5NF is eliminated, and they are now dead simple to write, as with the 5NF-enhanced-for-pivoting. Besides, most report tools provide pivoting, so I only need to provide functions which comprise heavy churning of stats, which needs to be performed on the server before shipment to the client.

3.5 Performance

Everyone has their cross to bear; I happen to be obsessed with Performance. My 5NF databases performed well, so let me assure you that I ran far more benchmarks than were necessary, before placing anything in production. The 6NF database performed exactly the same as the 5NF database, no better, no worse. This is no surprise, because the only thing the 'complex" 6NF SQL does, that the 5NF SQL doesn't, is perform much more joins and subqueries.

You have to examine the myths.

  • Anyone who has benchmarked the issue (i.e examined the execution plans of queries) will know that Joins Cost Nothing, it is a compile-time resolution, they have no effect at execution time.
  • Yes, of course, the number of tables joined; the size of the tables being joined; whether indices can be used; the distribution of the keys being joined; etc, all cost something.
  • But the join itself costs nothing.
  • A query on five (larger) tables in a Unnormalised database is much slower than the equivalent query on ten (smaller) tables in the same database if it were Normalised. the point is, neither the four nor the nine Joins cost anything; they do not figure in the performance problem; the selected set on each Join does figure in it.

3.6 Benefit

  1. Unrestricted columnar access. This is where 6NF really stands out. The straight columnar access was so fast that there was no need to export the data to a data warehouse in order to obtain speed from specialised DW structures.

    My research into a few DWs, by no means complete, shows that they consistently store data by columns, as opposed to rows, which is exactly what 6NF does. I am conservative, so I am not about to make any declarations that 6NF will displace DWs, but in my case it eliminated the need for one.

  2. It would not be fair to compare functions available in 6NF that were unavailable in 5NF (eg. Pivoting), which obviously ran much faster.

That was our first true 6NF database (with a full catalogue, etc; as opposed to the always 5NF with enhancements only as necessary; which later turned out to be 6NF), and the customer is very happy. Of course I was monitoring performance for some time after delivery, and I identified an even faster columnar access method for my next 6NF project. That, when I do it, might present a bit of competition for the DW market. The customer is not ready, and we do not fix that which is not broken.

3.7 What, Exactly, about 6NF, is "Bad" ?

Note that not everyone would approach the job with as much formality, structure, and adherence to standards. So it would be silly to conclude from our project, that all 6NF databases perform well, and are easy to maintain. It would be just as silly to conclude (from looking at the implementations of others) that all 6NF databases perform badly, are hard to maintain; disasters. As always, with any technical endeavour, the resulting performance and ease of maintenance are strictly dependent on formality, structure, and adherence to standards, in addition to the relevant skill set.

4 Entity Attribute Value

Disclosure: Experience. I have inspected a few of these, mostly hospital and medical systems. I have performed corrective assignments on two of them. The initial delivery by the overseas provider was quite adequate, although not great, but the extensions implemented by the local provider were a mess. But not nearly the disaster that people have posted about re EAV on this site. A few months intense work fixed them up nicely.

4.1 What It Is

It was obvious to me that the EAV implementations I have worked on are merely subsets of Sixth Normal Form. Those who implement EAV do so because they want some of the features of 6NF (eg. ability to add columns without DDL changes), but they do not have the academic knowledge to implement true 6NF, or the standards and structures to implement and administer it securely. Even the original provider did not know about 6NF, or that EAV was a subset of 6NF, but they readily agreed when I pointed it out to them. Because the structures required to provide EAV, and indeed 6NF, efficiently and effectively (catalogue; Views; automated code generation) are not formally identified in the EAV community, and are missing from most implementations, I classify EAV as the bastard son Sixth Normal Form.

4.2 What, Exactly, about EAV, is "Bad" ?

Going by the comments in this and other threads, yes, EAV done badly is a disaster. More important (a) they are so bad that the performance provided at 5NF (forget 6NF) is lost and (b) the ordinary isolation from the complexity has not been implemented (coders and users are "forced" to use cumbersome navigation). And if they did not implement a catalogue, all sorts of preventable errors will not have been prevented.

All that may well be true for bad (EAV or other) implementations, but it has nothing to do with 6NF or EAV. The two projects I worked had quite adequate performance (sure, it could be improved; but there was no bad performance due to EAV), and good isolation of complexity. Of course, they were nowhere near the quality or performance of my 5NF databases or my true 6NF database, but they were fair enough, given the level of understanding of the posted issues within the EAV community. They were not the disasters and sub-standard nonsense alleged to be EAV in these pages.

5 Nulls

There is a well-known and documented issue called The Null Problem. It is worthy of an essay by itself. For this post, suffice to say:

  • the problem is really the optional or missing value; here the consideration is table design such that there are no Nulls vs Nullable columns
  • actually it does not matter because, regardless of whether you use Nulls/No Nulls/6NF to exclude missing values, you will have to code for that, the problem precisely then, is handling missing values, which cannot be circumvented
    • except of course for pure 6NF, which eliminates the Null Problem
    • the coding to handle missing values remains
      • except, with automated generation of SQL code, heh heh
  • Nulls are bad news for performance, and many of us have decided decades ago not to allow Nulls in the database (Nulls in passed parameters and result sets, to indicate missing values, is fine)
    • which means a set of Null Substitutes and boolean columns to indicate missing values
  • Nulls cause otherwise fixed len columns to be variable len; variable len columns should never be used in indices, because a little 'unpacking' has to be performed on every access of every index entry, during traversal or dive.

6 Position

I am not a proponent of EAV or 6NF, I am a proponent of quality and standards. My position is:

  1. Always, in all ways, do whatever you are doing to the highest standard that you are aware of.

  2. Normalising to Third Normal Form is minimal for a Relational Database (5NF for me). DataTypes, Declarative referential Integrity, Transactions, Normalisation are all essential requirements of a database; if they are missing, it is not a database.

    • if you have to "denormalise for performance", you have made serious Normalisation errors, your design in not normalised. Period. Do not "denormalise", on the contrary, learn Normalisation and Normalise.
  3. There is no need to do extra work. If your requirement can be fulfilled with 5NF, do not implement more. If you need Optional Values or ability to add columns without DDL changes or the complete elimination of the Null Problem, implement 6NF, only in those tables that need them.

  4. If you do that, due only to the fact that SQL does not provide proper support for 6NF, you will need to implement:

    • a simple and effective catalogue (column mix-ups and data integrity loss are simply not acceptable)
    • 5NF access for the 6NF tables, via VIEWS, to isolate the users (and developers) from the encumbered (not "complex") SQL
    • write or buy utilities, so that you can generate the cumbersome SQL to construct the 5NF rows from the 6NF tables, and avoid writing same
    • measure, monitor, diagnose, and improve. If you have a performance problem, you have made either (a) a Normalisation error or (b) a coding error. Period. Back up a few steps and fix it.
  5. If you decide to go with EAV, recognise it for what it is, 6NF, and implement it properly, as above. If you do, you will have a successful project, guaranteed. If you do not, you will have a dog's breakfast, guaranteed.

6.1 There Ain't No Such Thing As A Free Lunch

That adage has been referred to, but actually it has been misused. The way it actually, deeply applies is as above: if you want the benefits of 6NF/EAV, you had better be willing too do the work required to obtain it (catalogue, standards). Of course, the corollary is, if you don't do the work, you won't get the benefit. There is no "loss" of Datatypes; value Domains; Foreign keys; Checks; Rules. Regarding performance, there is no performance penalty for 6NF/EAV, but there is always a substantial performance penalty for sub-standard work.

7 Specific Question

Finally. With due consideration to the context above, and that it is a small project with a small team, there is no question:

  • Do not use EAV (or 6NF for that matter)

  • Do not use Nulls or Nullable columns (unless you wish to subvert performance)

  • Do use a single Payment table for the common payment columns

  • and a child table for each PaymentType, each with its specific columns

  • All fully typecast and constrained.

  • What's this "another row_id" business ? Why do some of you stick an ID on everything that moves, without checking if it is a deer or an eagle ? No. The child is a dependent child. The Relation is 1::1. The PK of the child is the PK of the parent, the common Payment table. This is an ordinary Supertype-Subtype cluster, the Differentiator is PaymentTypeCode. Subtypes and supertypes are an ordinary part of the Relational Model, and fully catered for in the database, as well as in any good modelling tool.

    Sure, people who have no knowledge of Relational databases think they invented it 30 years later, and give it funny new names. Or worse, they knowingly re-label it and claim it as their own. Until some poor sod, with a bit of education and professional pride, exposes the ignorance or the fraud. I do not know which one it is, but it is one of them; I am just stating facts, which are easy to confirm.

A. Responses to Comments

A.1 Attribution

  1. I do not have personal or private or special definitions. All statements regarding the definition (such as imperatives) of:
    • Normalisation,
    • Normal Forms, and
    • the Relational Model.
      .
      refer to the many original texts By EF Codd and CJ Date (not available free on the web)
      .
      The latest being Temporal Data and The Relational Model by CJ Date, Hugh Darwen, Nikos A Lorentzos
      .
      and nothing but those texts
      .
      "I stand on the shoulders of giants"
      .
  2. The essence, the body, all statements regarding the implementation (eg. subjective, and first person) of the above are based on experience; implementing the above principles and concepts, as a commercial organisation (salaried consultant or running a consultancy), in large financial institutions in America and Australia, over 32 years.
    • This includes scores of large assignments correcting or replacing sub-standard or non-relational implementations.
      .
  3. The Null Problem vis-a-vis Sixth Normal Form
    A freely available White Paper relating to the title (it does not define The Null Problem alone) can be found at:
    http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf.
    .
    A 'nutshell' definition of 6NF (meaningful to those experienced with the other NFs), can be found on p6

A.2 Supporting Evidence

  1. As stated at the outset, the purpose of this post is to counter the misinformation that is rife in this community, as a service to the community.

  2. Evidence supporting statements made re the implementation of the above principles, can be provided, if and when specific statements are identified; and to the same degree that the incorrect statements posted by others, to which this post is a response, is likewise evidenced. If there is going to be a bun fight, let's make sure the playing field is level

  3. Here are a few docs that I can lay my hands on immediately.

    a. Large Bank
    This is the best example, as it was undertaken for explicitly the reasons in this post, and goals were realised. They had a budget for Sybase IQ (DW product) but the reports were so fast when we finished the project, they did not need it. The trade analytical stats were my 5NF plus pivoting extensions which turned out to be 6NF, described above. I think all the questions asked in the comments have been answered in the doc, except:
    - number of rows:
    - old database is unknown, but it can be extrapolated from the other stats
    - new database = 20 tables over 100M, 4 tables over 10B.

    b. Small Financial Institute Part A
    Part B - The meat
    Part C - Referenced Diagrams
    Part D - Appendix, Audit of Indices Before/After (1 line per Index)
    Note four docs; the fourth only for those who wish to inspect detailed Index changes. They were running a 3rd party app that could not be changed because the local supplier was out of business, plus 120% extensions which they could, but did not want to, change. We were called in because they upgraded to a new version of Sybase, which was much faster, which shifted the various performance thresholds, which caused large no of deadlocks. Here we Normalised absolutely everything in the server except the db model, with the goal (guaranteed beforehand) of eliminating deadlocks (sorry, I am not going to explain that here: people who argue about the "denormalisation" issue, will be in a pink fit about this one). It included a reversal of "splitting tables into an archive db for performance", which is the subject of another post (yes, the new single table performed faster than the two spilt ones). This exercise applies to MS SQL Server [insert rewrite version] as well.

    c. Yale New Haven Hospital
    That's Yale School of Medicine, their teaching hospital. This is a third-party app on top of Sybase. The problem with stats is, 80% of the time they were collecting snapshots at nominated test times only, but no consistent history, so there is no "before image" to compare our new consistent stats with. I do not know of any other company who can get Unix and Sybase internal stats on the same graphs, in an automated manner. Now the network is the threshold (which is a Good Thing).

倾城花音 2024-10-06 10:10:34

也许你应该看看这个问题

Bill Karwin 接受的答案涉及针对键/值表(通常称为实体)的特定参数属性值 (EAV)

..虽然很多人似乎都赞成
EAV,我不知道。看来是最
灵活的解决方案,因此
最好的。不过,请记住这句格言
TANSTAAFL。以下是一些
EAV 的缺点:

  • 无法强制指定列(相当于 NOT NULL)。
  • 无法使用 SQL 数据类型来验证条目。
  • 无法确保属性名称拼写一致。
  • 无法在任何给定属性的值上放置外键,例如
    查找表。
  • 在传统的表格布局中获取结果非常复杂并且
    昂贵,因为要获得属性
    从你需要做的多行中
    每个属性的 JOIN

EAV 提供的灵活性程度
你需要在其他方面做出牺牲
区域,可能使您的代码为
比原本复杂(或更糟)
是为了解决原来的问题
更传统的方式。

在大多数情况下,这是不必要的
拥有这种程度的灵活性。
在OP关于产品的问题中
类型,创建一个更简单
每个产品类型的表
产品特定的属性,让您
有一些一致的结构
至少对以下条目强制执行
相同的产品类型。

仅当每一行都必须时,我才会使用 EAV
被允许潜在地拥有
不同的属性集。当你
有一组有限的产品类型,
EAV 太过分了。等级表
继承将是我的第一选择。

Perhaps you should look this question

The accepted answer from Bill Karwin goes into specific arguments against the key/value table usually know as Entity Attribute Value (EAV)

.. Although many people seem to favor
EAV, I don't. It seems like the most
flexible solution, and therefore the
best. However, keep in mind the adage
TANSTAAFL. Here are some of the
disadvantages of EAV:

  • No way to make a column mandatory (equivalent of NOT NULL).
  • No way to use SQL data types to validate entries.
  • No way to ensure that attribute names are spelled consistently.
  • No way to put a foreign key on the values of any given attribute, e.g.
    for a lookup table.
  • Fetching results in a conventional tabular layout is complex and
    expensive, because to get attributes
    from multiple rows you need to do
    JOIN for each attribute.

The degree of flexibility EAV gives
you requires sacrifices in other
areas, probably making your code as
complex (or worse) than it would have
been to solve the original problem in
a more conventional way.

And in most cases, it's an unnecessary
to have that degree of flexibility.
In the OP's question about product
types, it's much simpler to create a
table per product type for
product-specific attributes, so you
have some consistent structure
enforced at least for entries of the
same product type.

I'd use EAV only if every row must
be permitted to potentially have a
distinct set of attributes. When you
have a finite set of product types,
EAV is overkill. Class Table
Inheritance would be my first choice.

眼藏柔 2024-10-06 10:10:34

我的第一条原则是不要无缘无故地重新设计某些东西。所以我会选择选项 1,因为这是您当前的设计,并且它具有经过验证的工作记录。

将重新设计的时间花在新功能上。

My #1 principle is not to redesign something for no reason. So I would go with option 1 because that's your current design and it has a proven track record of working.

Spend the redesign time on new features instead.

佞臣 2024-10-06 10:10:34

如果我从头开始设计,我会选择第二个。它为您提供所需的灵活性。然而,随着第 1 号已经就位并正在工作,并且这对整个应用程序来说相当重要,如果没有很好地了解到底是什么查询、存储过程、视图、UDF、报告、导入,我可能会对进行重大设计更改持谨慎态度。等等你必须改变。如果这是我可以以相对较低的风险(并且已经到位的良好测试)完成的事情。我可能会更改解决方案 2,否则您可能会引入新的更严重的错误。

在任何情况下我都不会使用 EAV 表来做这样的事情。它们的查询和性能都很糟糕,而且灵活性被高估了(询问用户是否愿意每年添加 3-4 次新类型,而无需以日常性能为代价更改程序)。

If I were designing from scratch I would go with number two. It gives you the flexibility you need. However with number 1 already in place and working and this being soemting rather central to your whole app, i would probably be wary of making a major design change without a good idea of exactly what queries, stored procs, views, UDFs, reports, imports etc you would have to change. If it was something I could do with a relatively low risk (and agood testing alrady in place.) I might go for the change to solution 2 otherwise you might beintroducing new worse bugs.

Under no circumstances would I use an EAV table for something like this. They are horrible for querying and performance and the flexibility is way overrated (ask users if they prefer to be able to add new types 3-4 times a year without a program change at the cost of everyday performance).

蒗幽 2024-10-06 10:10:34

乍一看,我会选择选项 2(或 3):如果可能,进行概括。
我认为选项 4 不是很关系,并且会使您的查询变得复杂。
当面对这些问题时,我通常会用“用例”来面对这些选项:
- 设计 2/3 在执行此或此操作时表现如何?

At first sight, I would go for option 2 (or 3): when possible, generalize.
Option 4 is not very Relational I think, and will make your queries complex.
When confronted to those question, I generally confront those options with "use cases":
-how is design 2/3 behaving when do this or this operation ?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文