SQL 连接与单表:性能差异?
我试图坚持保持数据库规范化的做法,但这导致需要运行多个联接查询。 如果许多查询使用联接而不是调用可能包含冗余数据的单个表,是否会导致性能下降?
I am trying to stick to the practice of keeping the database normalized, but that leads to the need to run multiple join queries. Is there a performance degradation if many queries use joins vs having a call to a single table that might contain redundant data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
保持数据库标准化,直到发现瓶颈。 然后只有在仔细分析之后才可以进行非规范化。
在大多数情况下,拥有良好的索引覆盖集和最新的统计数据将解决大多数性能和阻塞问题,而无需任何非规范化。
如果对单个表进行写入和读取,则使用单个表可能会导致性能下降。
Keep the Database normalised UNTIL you have discovered a bottleneck. Then only after careful profiling should you denormalise.
In most instances, having a good covering set of indexes and up to date statistics will solve most performance and blocking issues without any denormalisation.
Using a single table could lead to worse performance if there are writes as well as reads against it.
迈克尔·杰克逊(不是那个)众所周知曾说过、
那可能是在 RDBMS 出现之前,但我认为他会扩展规则以将它们包括在内。
规范化数据模型几乎总是需要多表 SELECT; 正如此类问题的常见情况,“非规范化”的“正确”答案是什么? 问题取决于几个因素。
数据库管理系统平台。
多表查询与单表查询的相对性能受到应用程序所在平台的影响:查询优化器的复杂程度可能会有所不同。 例如,根据我的经验,MySQL 在单表查询上速度快得惊人,但在优化多连接查询时却表现不佳。 对于较小的表(例如,少于 10K 行),这不是一个真正的问题,但对于大型表(10M+)来说,这确实是个问题。
数据量
除非您正在查看 100K+ 行区域中的表,否则几乎不应该有问题。 如果您正在查看数百行的表大小,我什至不会费心考虑索引。
(反)规范化 规范化
的全部目的是尽量减少重复,尽量确保任何必须更新的字段值只需要在一个地方进行更改。 非规范化打破了这一点,如果重复数据的更新很少(理想情况下它们永远不应该发生),那么这并不是什么大问题。 因此,在复制除最静态数据之外的任何内容之前,请务必仔细考虑,请注意,您的数据库可能会显着增长。
要求/限制
您想要满足哪些性能要求? 您有固定的硬件或预算吗? 有时,通过硬件升级可以最轻松、甚至最便宜地实现性能提升。 您预计交易量是多少? 小企业会计系统的概况与 Twitter 等系统截然不同。
最后一个想法让我震惊:如果你足够非规范化,你的数据库与平面文件有什么不同? SQL 非常适合灵活的数据和多维检索,但它可能比直接顺序或相当简单的索引文件慢一个数量级(至少)。
Michael Jackson (not that one) is famously believed to have said,
That was probably before RDBMSs were around, but I think he'd have extended the Rules to include them.
Multi-table SELECTs are almost always needed with a normalised data model; as is often the case with this kind of question, the "correct" answer to the "denormalise?" question depends on several factors.
DBMS platform.
The relative performance of multi- vs single-table queries is influenced by the platform on which your application lives: the level of sophistication of the query optimisers can vary. MySQL, for example, in my experience, is screamingly fast on single-table queries but doesn't optimise queries with multiple joins so well. This isn't a real issue with smaller tables (less than 10K rows, say) but really hurts with large (10M+) ones.
Data volume
Unless you're looking at tables in the 100K+ row region, there pretty much shouldn't be a problem. If you're looking at table sizes in the hundreds of rows, I wouldn't even bother thinking about indexing.
(De-)normalisation
The whole point of normalisation is to minimise duplication, to try to ensure that any field value that must be updated need only be changed in one place. Denormalisation breaks that, which isn't much of a problem if updates to the duplicated data are rare (ideally they should never occur). So think very carefully before duplicating anything but the most static data, Note that your database may grow significantly
Requirements/Constraints
What performance requirements are you trying to meet? Do you have fixed hardware or a budget? Sometimes a performance boost can be most easily - and even most cheaply - achieved by a hardware upgrade. What transaction volumes are you expecting? A small-business accounting system has a very different profile to, say, Twitter.
One last thought strikes me: if you denormalise enough, how is your database different from a flat file? SQL is superb for flexible data and multi-dimensional retieval, but it can be an order of magnitude (at least) slower than a straight sequential or fairly simply indexed file.
我们将查询优化留给数据库,其原因与我们将代码优化留给编译器的原因相同。
如今,大多数现代 RDBMS 在这方面都表现得相当好。
在您认为非规范化在某些情况下“可以”之前,请考虑一下:通常您并不对每个属性都感兴趣。 因此,从磁盘加载不需要的数据效率很低(通常是数据库效率最低的组件)。 如果您采用非规范化设计且连续存在大量冗余数据,情况可能会更糟。 如果您必须更新所有冗余数据,情况会更糟。 加载一些仅包含感兴趣的列的窄表并将它们连接起来会更有效。 同样,这取决于数据库,因此如果没有分析,您就没有任何线索。
如果您真的担心性能,那么您可能正在谈论可扩展性问题。 在这种情况下,您可能需要查看分片,其中正确的(标准化)架构设计很重要。
We leave query optimisation up to the database for the same reasons we leave code optimisation up to the compiler.
Most modern RDBMSes are pretty good in this respect these days.
Before you think that denormalisation is 'ok' in some cases, consider this: normally you are not interested in every attribute. Therefore loading unneeded data off the disk is inefficient (typically the least efficient component of the database). This can be much worse if you have a denormalised design, with lots of redundant data in a row. Even worse again if you have to then update all that redundant data. It can be much more efficient to load some narrow tables containing only the columns of interest and join them. Again, this depends on the database, so without profiling you have no clue.
If you are really worried about performance, you're probably talking scalability issues. In this case you might want to look at sharding, for which proper (normalised) schema design is important.
理智差异。
Sanity difference.
为了标准化而分解表是有成本的。 该成本包含性能部分。 通过以下方式可以将分解表和连接查询中的数据的性能成本保持在较低水平: 使用良好的 DBMS; 正确设计表格; 正确设计索引; 让优化器完成它的工作; 并调整物理设计的 DBMS 特定功能。
构建实现连接的大型表也是有成本的。 关于标准化的优秀教程概述了更新异常和编程困难方面的成本。 组合表也会带来性能成本。 在许多 DBMS 产品中,将非常大的行加载到内存中比加载较小的行花费更多。 当您编写非常宽的表时,最终会迫使 DBMS 读取非常大的行,而只会丢弃读入内存的大部分数据。 这比标准化更能减慢你的速度。
一般来说,不要随意进行反规范化。 必要时,请使用经过前人测试过的设计规则,即使该规则会导致一些非规范化。 我推荐星型模式作为这样的学科。 它有很多好处。 在很多情况下,标准化设计比星型模式设计效果更好。
学习多套设计原则以及何时使用哪套设计原则是学习成为专家的第二阶段。
There is a cost to decomposing tables for the sake of normalization. There is a performance component to that cost. The performance cost of decomposing tables and joining data in queries can be kept low by: using a good DBMS; designing tables right; designing indexes right; letting the optimizer do its job; and tuning the DBMS specific features of physical design.
There is also a cost to composing large tables that materialize joins. The cost in terms of update anomalies and programming difficulties is outlined in good tutorials on normalization. There is also a performance cost to composing tables. In many DBMS products, loading a very big row into memory costs more than loading a smaller row. When you compose very wide tables, you end up forcing the DBMS to read very big rows, only to discard most of the data read into memory. This can slow you down even more than normalization does.
In general, don't denormalize at random. When necessary, use a design discipline that has been tested by people who went before you, even if that discipline results in some denormalization. I recommend star schema as such a discipline. It has a lot going for it. And there are still plenty of situations where a normalized design works better than a star schema design.
Learning more than one set of design principles and learning when to use which set is the second stage of learning to be an expert.
通过设置正确的索引,您的联接可以非常快速地执行。 使用 SQL Profiler 确定需要创建或更改哪些索引来优化常见查询的性能。 确保为您的数据库设置一个维护计划,每周运行一次(对于变化很大的表,每天运行一次)来更新统计信息和索引。
标准化通常优于将数据保存在多个位置。 在某些情况下,插入/更新不需要快速发生,而选择需要非常快速地发生,在这种情况下,如果不进行标准化,情况可能会更好。 即便如此,也不建议过早优化,因此首先采用标准化结构。
With the proper indexes set up, your joins can perform very quickly. Use SQL Profiler to determine what indexes need to be created or altered to optimize performance of your common queries. Be sure to have a maintenance plan set up for your database to run once a week (or every day for tables that change a lot) that updates your statistics and indexes.
Normalization is normally preferred over keeping data in multiple locations. There are scenarios where insert/update does not need to occur quickly and select need to occur very quickly in which case you could be better off without normalization. Even so, premature optimization is not recommended so go with a normalized structure first.
事实上,通过某些云站点实现的终极超级优化之一是使用少量的、更宽的、功能有限的表来提高效率。 到目前为止,如果您需要大规模扩展,这是一种方法。 但对于任何关系数据库管理系统来说,这都不被认为是理想的做法(但事实并非如此)。
如果您遇到性能问题,那么在进行任何类型的非规范化之前,首先需要做很多事情。
One of the ultimate hyperoptimizations available through some of the cloud sites is, in fact, using a smaller number of wider, limited-capability tables for efficiency. So far in the future if you need to wildly scale, this is one way. But it's not considered desirable practice for any relational dbms (which those aren't).
If you're having performance problems, there are a lot of things to work on first, before any kind of denormalizing.