非规范化如何提高数据库性能?
我听说过很多关于非规范化的内容,它是为了提高某些应用程序的性能而进行的。但我从来没有尝试过做任何相关的事情。
所以,我只是好奇,规范化数据库中的哪些地方会使性能变差,或者换句话说,非规范化原则是什么?
如果我需要提高性能,如何使用此技术?
I heard a lot about denormalization which was made to improve performance of certain application. But I've never tried to do anything related.
So, I'm just curious, which places in normalized DB makes performance worse or in other words, what are denormalization principles?
How can I use this technique if I need to improve performance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
非规范化通常用于:
非规范化的基本思想是添加冗余数据或对一些数据进行分组,以便能够以更小的成本更轻松地获取这些数据;这对表演来说更好。
一个简单的例子?
现在,有一些成本,是的:
Denormalization is generally used to either:
The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost; which is better for performances.
A quick examples?
Now, there are some costs, yes:
非规范化是一种时间-空间权衡。规范化数据占用更少的空间,但可能需要连接来构造所需的结果集,因此需要更多时间。如果它是非规范化的,数据会被复制到多个地方。这样会占用更多空间,但可以轻松获得所需的数据视图。
还有其他时空优化,例如
与任何此类方法一样,这可以改进读取数据(因为它们随时可用),但更新数据成本变得更高(因为您需要更新复制或预先计算的数据)。
Denormalization is a time-space trade-off. Normalized data takes less space, but may require join to construct the desired result set, hence more time. If it's denormalized, data are replicated in several places. It then takes more space, but the desired view of the data is readily available.
There are other time-space optimizations, such as
As with any of such approach, this improves reading data (because they are readily available), but updating data becomes more costly (because you need to update the replicated or precomputed data).
“非规范化”这个词会导致设计问题的混乱。试图通过非规范化获得高性能数据库就像试图开车离开纽约到达目的地一样。它不会告诉你该走哪条路。
您需要的是一种良好的设计准则,一种能够产生简单而合理的设计的准则,即使该设计有时与规范化规则相冲突。
星型模式就是这样的一种设计规则。在星型模式中,单个事实表充当星型表的中心。其他表称为维度表,它们位于架构的边缘。维度通过类似于车轮辐条的关系连接到事实表。星型模式基本上是一种将多维设计投射到 SQL 实现上的方法。
与星型模式密切相关的是雪花模式,它稍微复杂一些。
如果您有一个良好的星型模式,您将能够通过不超过三路联接(涉及二维和一个事实表)获得多种数据组合。不仅如此,许多 OLAP 工具将能够自动解读您的星型设计,并为您提供对数据的点击、向下钻取和图形分析访问,无需进一步编程。
星型模式设计有时会违反第二范式和第三范式,但它可以提高报告和摘要的速度和灵活性。它最常用于数据仓库、数据集市和报告数据库。通常,与随意的“非规范化”相比,星型模式或其他一些面向检索的设计会获得更好的结果。
The word "denormalizing" leads to confusion of the design issues. Trying to get a high performance database by denormalizing is like trying to get to your destination by driving away from New York. It doesn't tell you which way to go.
What you need is a good design discipline, one that produces a simple and sound design, even if that design sometimes conflicts with the rules of normalization.
One such design discipline is star schema. In a star schema, a single fact table serves as the hub of a star of tables. The other tables are called dimension tables, and they are at the rim of the schema. The dimensions are connected to the fact table by relationships that look like the spokes of a wheel. Star schema is basically a way of projecting multidimensional design onto an SQL implementation.
Closely related to star schema is snowflake schema, which is a little more complicated.
If you have a good star schema, you will be able to get a huge variety of combinations of your data with no more than a three way join, involving two dimensions and one fact table. Not only that, but many OLAP tools will be able to decipher your star design automatically, and give you point-and-click, drill down, and graphical analysis access to your data with no further programming.
Star schema design occasionally violates second and third normal forms, but it results in more speed and flexibility for reports and extracts. It's most often used in data warehouses, data marts, and reporting databases. You'll generally have much better results from star schema or some other retrieval oriented design, than from just haphazard "denormalization".
非规范化的关键问题是:
最简单的非规范化类型之一是将身份字段填充到表中以避免连接。由于身份永远不会改变,这意味着保持数据同步的问题很少出现。例如,我们将客户端 ID 填充到多个表中,因为我们经常需要由客户端查询它们,并且在查询中不一定需要客户端表和我们正在查询的表之间的表中的任何数据如果数据完全标准化。您仍然需要执行一次联接来获取客户端名称,但这比联接 6 个父表来获取客户端名称要好,因为这是您需要从正在查询的表外部获取的唯一数据。
但是,除非我们经常在需要中间表中的数据的情况下进行查询,否则这样做没有任何好处。
另一种常见的非规范化可能是将名称字段添加到其他表中。由于名称本质上是可变的,因此您需要确保名称与触发器保持同步。但是,如果这可以让您免于连接 5 个表而不是 2 个表,那么稍微长一点的插入或更新的成本是值得的。
The critical issues in denormalizing are:
One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join. As identities should not ever change, this means the issue of keeping the data in sync rarely comes up. For instance, we populate our client id to several tables because we often need to query them by client and do not necessarily need, in the queries, any of the data in the tables that would be between the client table and the table we are querying if the data was totally normalized. You still have to do one join to get the client name, but that is better than joining to 6 parent tables to get the client name when that is the only piece of data you need from outside the table you are querying.
However, there would be no benefit to this unless we were often doing queries where data from the intervening tables was needed.
Another common denormalization might be to add a name field to other tables. As names are inherently changeable, you need to ensure that the names stay in synch with triggers. But if this saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert or update.
如果您有某些要求,例如报告等,它可以帮助您以各种方式对数据库进行非规范化:
引入某些数据重复来为自己节省一些连接(例如,将某些信息填充到表中并可以处理重复的数据,因此该表中的所有数据不需要通过连接另一个表来找到)
您可以预先计算某些值并将它们存储在表列中,而不是每次查询数据库时动态计算它们。当然,随着时间的推移,这些计算值可能会变得“陈旧”,您可能需要在某个时候重新计算它们,但仅读出固定值通常比计算某些内容(例如计算子行)要便宜
当然还有更多方法对数据库模式进行非规范化以提高性能,但您只需要意识到这样做确实会给自己带来一定程度的麻烦。在做出这些决定时,您需要仔细权衡利弊 - 性能优势与您遇到的问题。
If you have certain requirement, like reporting etc., it can help to denormalize your database in various ways:
introduce certain data duplication to save yourself some JOINs (e.g. fill certain information into a table and be ok with duplicated data, so that all the data in that table and doesn't need to be found by joining another table)
you can pre-compute certain values and store them in a table column, insteda of computing them on the fly, everytime to query the database. Of course, those computed values might get "stale" over time and you might need to re-compute them at some point, but just reading out a fixed value is typically cheaper than computing something (e.g. counting child rows)
There are certainly more ways to denormalize a database schema to improve performance, but you just need to be aware that you do get yourself into a certain degree of trouble doing so. You need to carefully weigh the pros and cons - the performance benefits vs. the problems you get yourself into - when making those decisions.
考虑一个具有正确规范化父子关系的数据库。
假设基数是平均值 2x1。
您有两个表,Parent,有 p 行。有 2x p 行的子项。
连接操作意味着对于 p 个父行,必须读取 2x p 个子行。读取的总行数为 p + 2x p。
考虑将其非规范化为仅包含子行的单个表,2x p。读取的行数是 2x p。
更少的行 == 更少的物理 I/O == 更快。
Consider a database with a properly normalized parent-child relationship.
Let's say the cardinality is an average of 2x1.
You have two tables, Parent, with p rows. Child with 2x p rows.
The join operation means for p parent rows, 2x p child rows must be read. The total number of rows read is p + 2x p.
Consider denormalizing this into a single table with only the child rows, 2x p. The number of rows read is 2x p.
Fewer rows == less physical I/O == faster.
根据本文的最后一部分,
https: //technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx
人们可以使用虚拟非规范化,您可以在其中创建具有一些非规范化数据的视图可以更快地运行更简单的 SQL 查询,而基础表仍保持规范化以实现更快的添加/更新操作(只要您可以定期而不是实时更新视图)。我自己刚刚上了一门关系数据库课程,但是从我所读到的内容来看,这种方法对我来说似乎是合乎逻辑的。
As per the last section of this article,
https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx
one could use Virtual Denormalization, where you create Views with some denormalized data for running more simplistic SQL queries faster, while the underlying Tables remain normalized for faster add/update operations (so long as you can get away with updating the Views at regular intervals rather than in real-time). I'm just taking a class on Relational Databases myself but, from what I've been reading, this approach seems logical to me.
反规范化相对规范化的优势
基本上,反规范化用于 DBMS,而不是 RDBMS。我们知道,RDBMS 采用标准化工作,这意味着不会一次又一次重复数据。但使用外键时仍然会重复一些数据。
当您使用 DBMS 时,需要删除规范化。为此,需要重复。但仍然提高了性能,因为表之间没有关系,每个表都是不可分割的存在。
Benefits of de-normalization over normalization
Basically de-normalization is used for DBMS not for RDBMS. As we know that RDBMS works with normalization, which means no repeat data again and again. But still repeat some data when you use foreign key.
When you use DBMS then there is a need to remove normalization. For this, there is a need for repetition. But still, it improves performance because there is no relation among the tables and each table has indivisible existence.