数据库连接何时以及为何昂贵?

发布于 2024-07-06 13:03:02 字数 368 浏览 9 评论 0原文

我正在对数据库进行一些研究,并且正在研究关系数据库的一些限制。

我发现大表的连接非常昂贵,但我不完全确定为什么。 DBMS需要做什么来执行连接操作,瓶颈在哪里?
非规范化如何帮助克服这种费用? 其他优化技术(例如索引)有何帮助?

欢迎个人经历! 如果您要发布资源链接,请避免使用维基百科。 我已经知道在哪里可以找到它了。

与此相关,我想知道 BigTable 和 SimpleDB 等云服务数据库使用的非规范化方法。 请参阅此问题

I'm doing some research into databases and I'm looking at some limitations of relational DBs.

I'm getting that joins of large tables is very expensive, but I'm not completely sure why. What does the DBMS need to do to execute a join operation, where is the bottleneck?
How can denormalization help to overcome this expense? How do other optimization techniques (indexing, for example) help?

Personal experiences are welcome! If you're going to post links to resources, please avoid Wikipedia. I know where to find that already.

In relation to this, I'm wondering about the denormalized approaches used by cloud service databases like BigTable and SimpleDB. See this question.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

沉溺在你眼里的海 2024-07-13 13:03:02

非规范化以提高性能? 听起来很有说服力,但站不住脚。

Chris Date 与 Ted Codd 博士一起是关系数据模型的最初支持者,他对反对标准化的错误论点失去了耐心,并使用科学方法系统地推翻了它们:他获得了大型数据库并进行了测试。这些断言。

我认为他在《1988-1991 年关系数据库著作》中写了它,但这本书后来被收录到《数据库系统导论》的第六版中,这是《数据库系统导论》的第六版。 em> 关于数据库理论和设计的权威文本,在我撰写的第八版中,并且可能在未来几十年内继续印刷。 当我们大多数人还赤脚奔跑时,克里斯·戴特(Chris Date)就是这个领域的专家。

他发现:

  • 其中一些适用于特殊情况
  • 所有这些都无法满足一般用途
  • 所有这些对于其他特殊情况都明显更差

这一切都归咎于减小工作集的大小。 涉及正确选择的键和正确设置的索引的联接很便宜,而不是昂贵,因为它们允许在具体化行之前对结果进行大量修剪。

实现结果涉及批量磁盘读取,这是该练习中成本最高的一个数量级。 相比之下,执行连接在逻辑上只需要检索。 在实践中,甚至不获取键值:键哈希值用于连接比较,减轻多列连接的成本,并从根本上减少涉及字符串比较的连接成本。 不仅会更适合缓存,而且需要执行的磁盘读取也会少得多。

此外,一个好的优化器会选择最严格的条件并在执行连接之前应用它,非常有效地利用高基数索引上连接的高选择性。

诚然,这种类型的优化也可以应用于非规范化数据库,但是那些想要对模式进行非规范化的人在设置索引时通常不会考虑基数。

重要的是要理解表扫描(在生成连接的过程中检查表中的每一行)在实践中很少见。 仅当满足以下一项或多项条件时,查询优化器才会选择表扫描。

  • 关系中的行数少于 200 行(在这种情况下,扫描会更便宜)
  • 连接列上没有合适的索引(如果连接这些列有意义,那么为什么不对它们建立索引?修复它)
  • 类型在比较列之前需要强制转换(WTF?!修复它或回家)请参阅 ADO.NET 问题的结束说明
  • 比较的参数之一是表达式(无索引

)操作比不执行的成本更高。 然而,执行错误的操作,被迫进行无意义的磁盘I/O,然后在执行真正需要的连接之前丢弃糟粕,会更加昂贵。 即使预先计算了“错误”的操作并且合理地应用了索引,仍然存在显着的惩罚。 非规范化以预先计算连接(尽管会带来更新异常)是对特定连接的承诺。 如果您需要不同加入,那么该承诺将使您付出巨大的代价。

如果有人想提醒我这是一个不断变化的世界,我想你会发现更糟糕的硬件上更大的数据集只会夸大 Date 发现的传播范围。

对于所有从事计费系统或垃圾邮件生成器工作的人(为你们感到羞耻),并愤怒地把手放在键盘上告诉我,你们知道非规范化更快的事实,抱歉,但你们生活在一个特殊的环境中。情况 - 具体来说,您按顺序处理所有数据的情况。 这不是一般情况,您的策略是合理的。

你没有理由错误地概括它。 有关在数据仓库场景中适当使用非规范化的更多信息,请参阅注释部分的末尾。

我也想回复一下

连接只是带有一些唇彩的笛卡尔积

积,真是一派胡言。 尽早实施限制,最严格的首先实施。 你读过这个理论,但你还没有理解它。 连接被查询优化器视为“应用谓词的笛卡尔积”。 这是一种符号表示(实际上是标准化),以促进符号分解,以便优化器可以生成所有等效的转换,并按成本和选择性对它们进行排序,以便它可以选择最佳查询计划。

让优化器生成笛卡尔积的唯一方法是不提供谓词:SELECT * FROM A,B


注意


David Aldridge 提供了一些重要的附加信息。

除了索引和表扫描之外,确实还有多种其他策略,现代优化器将在生成执行计划之前将它们全部消耗殆尽。

一个实用的建议:如果它可以用作外键,则对其进行索引,以便优化器可用索引策略。

我曾经比MSSQL优化器更聪明。 两个版本前就改变了。 现在它通常会教。 从真正的意义上来说,它是一个专家系统,将许多非常聪明的人的所有智慧编入一个足够封闭的领域,使得基于规则的系统是有效的。


“胡说八道”可能有些不机智。 他们要求我不要那么傲慢,并提醒我数学不会说谎。 这是事实,但并非数学模型的所有含义都必须从字面上理解。 如果你小心地避免检查它们的荒谬性(双关语),并确保在尝试解释方程之前将它们全部取消,那么负数的平方根会非常方便。

我之所以做出如此野蛮的回应,是因为该声明的措辞是这样的:

连接笛卡尔积...

这可能不是本意,但它所写的,而且它绝对是不真实的。 笛卡尔积是一种关系。 连接是一个函数。 更具体地说,连接是关系值函数。 使用空谓词,它将产生笛卡尔积,检查它是否这样做是对数据库查询引擎的正确性检查,但在实践中没有人编写无约束连接,因为它们在课堂之外没有实际价值。

我之所以这么说,是因为我不想让读者陷入将模型与建模对象混淆的古老陷阱。 模型是一种近似值,是为了方便操作而故意简化的。


表扫描连接策略选择的截止点可能因数据库引擎而异。 它受到许多实现决策的影响,例如树节点填充因子、键值大小和算法的微妙之处,但一般来说,高性能索引的执行时间为 k log n + c。 C 项是固定开销,主要由设置时间组成,并且曲线的形状意味着在 n 达到数百之前您不会获得回报(与线性搜索相比)。


有时非规范化是一个好主意

非规范化是对特定连接策略的承诺。 如前所述,这会干扰其他连接策略。 但是,如果您拥有大量磁盘空间、可预测的访问模式,并且倾向于处理大部分或全部磁盘空间,那么预先计算联接可能非常值得。

您还可以找出您的操作通常使用的访问路径,并预先计算这些访问路径的所有联接。 这是数据仓库背后的前提,或者至少当它们是由那些知道为什么要做他们正在做的事情的人构建的,而不仅仅是为了遵守流行语时。

正确设计的数据仓库是通过标准化事务处理系统的批量转换定期生成的。 操作和报告数据库的这种分离具有消除OLTP和OLAP(在线事务处理,即数据输入和在线分析处理,即报告)之间的冲突的非常理想的效果。

这里重要的一点是,除了定期更新之外,数据仓库是只读的。 这使得更新异常的问题变得毫无意义。

不要错误地对 OLTP 数据库(发生数据输入的数据库)进行非规范化。 计费运行可能会更快,但如果这样做,您将收到更新异常。 是否曾经尝试过让《读者文摘》停止向您发送内容?

如今磁盘空间很便宜,因此请淘汰自己。 但非规范化只是数据仓库故事的一部分。 更大的性能提升来自于预先计算的汇总值:每月总计,诸如此类的事情。 总是减少工作集。


类型不匹配的 ADO.NET 问题

假设您有一个 SQL Server 表,其中包含 varchar 类型的索引列,并且您使用 AddWithValue 传递限制对此列的查询的参数。 C# 字符串是 Unicode,因此推断的参数类型将为 NVARCHAR,这与 VARCHAR 不匹配。

VARCHAR 到 NVARCHAR 是一种扩大的转换,因此它是隐式发生的 - 但请告别索引,祝你好运找出原因。


“计算磁盘命中次数”(Rick James)

如果所有内容都缓存在 RAM 中,JOIN 的成本相当低。 也就是说,规范化不会带来太多的性能损失。

如果“规范化”模式导致 JOIN 频繁访问磁盘,但等效的“非规范化”模式不必访问磁盘,那么非规范化会赢得性能竞争。

原作者的评论:现代数据库引擎非常善于组织访问顺序,以最大限度地减少连接操作期间的缓存未命中。 上述内容虽然正确,但可能会被误解为暗示连接在大数据上必然会带来昂贵的问题。 这会导致缺乏经验的开发人员做出错误的决策。

Denormalising to improve performance? It sounds convincing, but it doesn't hold water.

Chris Date, who in company with Dr Ted Codd was the original proponent of the relational data model, ran out of patience with misinformed arguments against normalisation and systematically demolished them using scientific method: he got large databases and tested these assertions.

I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems, which is the definitive text on database theory and design, in its eighth edition as I write and likely to remain in print for decades to come. Chris Date was an expert in this field when most of us were still running around barefoot.

He found that:

  • Some of them hold for special cases
  • All of them fail to pay off for general use
  • All of them are significantly worse for other special cases

It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.

Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.

Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.

Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.

It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.

  • There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
  • There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
  • A type coercion is required before the columns can be compared (WTF?! fix it or go home) SEE END NOTES FOR ADO.NET ISSUE
  • One of the arguments of the comparison is an expression (no index)

Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.

If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.

For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.

You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.

I'd also like to respond to

Joins are just cartesian products with some lipgloss

What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as "cartesian products to which predicates apply" only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.

The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B


Notes


David Aldridge provides some important additional information.

There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.

A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is available to the optimiser.

I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.


"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.

The reason that I responded so savagely was that the statement as worded says that

Joins are cartesian products...

This may not be what was meant but it is what was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.

I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.


The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of k log n + c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until n is in the hundreds.


Sometimes denormalisation is a good idea

Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with other join strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.

You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.

A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).

An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.

Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?

Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's always about reducing the working set.


ADO.NET problem with type mismatches

Suppose you have a SQL Server table containing an indexed column of type varchar, and you use AddWithValue to pass a parameter constraining a query on this column. C# strings are Unicode, so the inferred parameter type will be NVARCHAR, which doesn't match VARCHAR.

VARCHAR to NVARCHAR is a widening conversion so it happens implicitly - but say goodbye to indexing, and good luck working out why.


"Count the disk hits" (Rick James)

If everything is cached in RAM, JOINs are rather cheap. That is, normalization does not have much performance penalty.

If a "normalized" schema causes JOINs to hit the disk a lot, but the equivalent "denormalized" schema would not have to hit the disk, then denormalization wins a performance competition.

Comment from original author: Modern database engines are very good at organising access sequencing to minimise cache misses during join operations. The above, while true, might be miscontrued as implying that joins are necessarily problematically expensive on large data. This would lead to cause poor decision-making on the part of inexperienced developers.

柏林苍穹下 2024-07-13 13:03:02

大多数评论者没有注意到的是,复杂的 RDBMS 中可以使用多种连接方法,而非规范化器总是掩盖维护非规范化数据的较高成本。 并非每个联接都基于索引,并且数据库具有许多用于联接的优化算法和方法,旨在降低联接成本。

无论如何,连接的成本取决于其类型和其他一些因素。 它根本不需要很昂贵——举一些例子。

  • 散列连接(其中对大量数据进行等连接)确实非常便宜,并且只有当散列表无法缓存在内存中时,成本才会变得显着。 无需索引。 连接数据集之间的等分区会很有帮助。
  • 排序合并连接的成本是由排序成本而不是合并成本决定的——基于索引的访问方法实际上可以消除排序成本。
  • 索引上嵌套循环连接的成本由 B 树索引的高度和表块本身的访问决定。 它速度很快,但不适合批量连接。
  • 基于集群的嵌套循环连接要便宜得多,每个连接行所需的逻辑 IO 更少——如果连接的表都在同一个集群中,那么通过连接行的共置,连接变得非常便宜。

数据库被设计为连接,并且它们的执行方式非常灵活,并且通常非常高性能,除非它们的连接机制错误。

What most commenters fail to note is the wide range of join methodologies available in a complex RDBMS, and the denormalisers invariably gloss over the higher cost of maintaining denormalised data. Not every join is based on indexes, and databases have a lot of optimised algotithms and methodologies for joining that are intended to reduce join costs.

In any case, the cost of a join depends on its type and a few other factors. It needn't be expensive at all - some examples.

  • A hash join, in which bulk data is equijoined, is very cheap indeed, and the cost only become significant if the hash table cannot be cached in memory. No index required. Equi-partitioning between the joined data sets can be a great help.
  • The cost of a sort-merge join is driven by the cost of the sort rather than the merge -- an index-based access method can virtually eliminate the cost of the sort.
  • The cost of a nested loop join on an index is driven by the height of the b-tree index and the access of the table block itself. It's fast, but not suitable for bulk joins.
  • A nested loop join based on a cluster is much cheaper, with fewer logicAL IO'S required per join row -- if the joined tables are both in the same cluster then the join becomes very cheap through the colocation of joined rows.

Databases are designed to join, and they're very flexible in how they do it and generally very performant unless they get the join mechanism wrong.

再见回来 2024-07-13 13:03:02

我认为整个问题是基于一个错误的前提。 大型表上的连接不一定成本很高。 事实上,高效地进行连接是关系数据库存在的主要原因之一。 大型集合上的联接通常很昂贵,但很少有人希望将大型表 A 的全部内容与大型表 B 的全部内容联接起来。相反,您可以编写查询,使得 仅使用每个表的重要行,并且连接保留的实际集合仍然较小。

此外,您还具有 Peter Wone 提到的效率,这样在最终结果集实现之前,只有每条记录的重要部分需要存储在内存中。 此外,在具有许多连接的大型查询中,您通常希望从较小的表集开始,然后逐渐增加到较大的表集,以便保存在内存中的集尽可能小、尽可能长时间。

如果做得正确,联接通常是比较、组合或过滤大量数据的最佳方式

I think the whole question is based on a false premise. Joins on large tables are not necessarily expensive. In fact, doing joins efficiently is one of the main reasons relational databases exist at all. Joins on large sets often are expensive, but very rarely do you want to join the entire contents of large table A with the entire contents of large table B. Instead, you write the query such that only the important rows of each table are used and the actual set kept by the join remains smaller.

Additionally, you have the efficiencies mentioned by Peter Wone, such that only the important parts of each record need be in memory until the final result set is materialized. Also, in large queries with many joins you typically want to start with the smaller table sets and work your way up to the large ones, so that the set kept in memory remains as small as possible as long as possible.

When done properly, joins are generally the best way to compare, combine, or filter on large amounts of data.

眼眸印温柔 2024-07-13 13:03:02

瓶颈几乎总是磁盘 I/O,更具体地说是随机磁盘 I/O(相比之下,顺序读取速度相当快,并且可以使用预读策略进行缓存)。

如果您要跳来跳去地读取大表的一小部分,那么联接可以增加随机查找。 但是,查询优化器会寻找这一点,如果它认为更好的话,会将其转换为顺序表扫描(丢弃不需要的行)。

单个非规范化表也有类似的问题 - 行很大,因此不太适合单个数据页。 如果您需要距离较远的行(并且较大的行大小使它们相距更远),那么您将拥有更多的随机 I/O。 同样,可能会强制进行表扫描来避免这种情况。 但是,这一次,由于行大小较大,您的表扫描必须读取更多数据。 此外,您将数据从一个位置复制到多个位置,而 RDBMS 需要读取(和缓存)更多数据。

使用 2 个表,您还可以获得 2 个聚集索引 - 并且通常可以索引更多(因为插入/更新开销更少),这可以使您大幅提高性能(主要是,再次,因为索引(相对)小,可以快速从磁盘读取) (或者缓存成本低),并减少需要从磁盘读取的表行数)。

连接的唯一开销来自于找出匹配的行。 Sql Server 使用 3 种不同类型的联接(主要基于数据集大小)来查找匹配行。 如果优化器选择了错误的连接类型(由于统计不准确、索引不充分或者只是优化器错误或边缘情况),它可能会极大地影响查询时间。

  • 对于(至少 1)个小数据集来说,循环连接是非常便宜的。
  • 合并连接首先需要对两个数据集进行排序。 但是,如果您加入索引列,则索引已经排序,无需执行进一步的工作。 否则,排序会产生一些 CPU 和内存开销。
  • 哈希连接需要内存(用于存储哈希表)和 CPU(用于构建哈希)。 同样,这相对于磁盘 I/O 而言相当快。 但是,如果没有足够的 RAM 来存储哈希表,Sql Server 将使用 tempdb 来存储哈希表的部分内容和找到的行,然后一次仅处理哈希表的部分内容。 与所有磁盘一样,这相当慢。

在最佳情况下,这些不会导致磁盘 I/O - 因此从性能角度来看可以忽略不计。

总而言之,在最坏的情况下,从 x 连接表中读取相同数量的逻辑数据实际上应该更快,因为它是从单个非规范化表中读取的,因为磁盘读取较小。 要读取相同数量的物理数据,可能会产生一些轻微的开销。

由于查询时间通常由 I/O 成本主导,并且数据大小不会因非规范化而改变(减去一些非常小的行开销),因此仅将表合并在一起并不会带来巨大的好处。 倾向于提高性能的非规范化类型 IME 是缓存计算值,而不是读取计算它们所需的 10,000 行。

The bottleneck is pretty much always disk I/O, and even more specifically - random disk I/O (by comparison, sequential reads are fairly fast and can be cached with read ahead strategies).

Joins can increase random seeks - if you're jumping around reading small parts of a large table. But, query optimizers look for that and will turn it into a sequential table scan (discarding the unneeded rows) if it thinks that'd be better.

A single denormalized table has a similar problem - the rows are large, and so less fit on a single data page. If you need rows that are located far from another (and the large row size makes them further apart) then you'll have more random I/O. Again, a table scan may be forced to avoid this. But, this time, your table scan has to read more data because of the large row size. Add to that the fact that you're copying data from a single location to multiple locations, and the RDBMS has that much more to read (and cache).

With 2 tables, you also get 2 clustered indexes - and can generally index more (because of less insert/update overhead) which can get you drastically increased performance (mainly, again, because indexes are (relatively) small, quick to read off disk (or cheap to cache), and lessen the amount of table rows you need to read from disk).

About the only overhead with a join comes from figuring out the matching rows. Sql Server uses 3 different types of joins, mainly based on dataset sizes, to find matching rows. If the optimizer picks the wrong join type (due to inaccurate statistics, inadequate indexes, or just an optimizer bug or edge case) it can drastically affect query times.

  • A loop join is farily cheap for (at least 1) small dataset.
  • A merge join requires a sort of both datasets first. If you join on an indexed column, though, then the index is already sorted and no further work needs to be done. Otherwise, there is some CPU and memory overhead in sorting.
  • The hash join requires both memory (to store the hashtable) and CPU (to build the hash). Again, this is fairly quick in relation to the disk I/O. However, if there's not enough RAM to store the hashtable, Sql Server will use tempdb to store parts of the hashtable and the found rows, and then process only parts of the hashtable at a time. As with all things disk, this is fairly slow.

In the optimal case, these cause no disk I/O - and so are negligible from a performance perspective.

All in all, at worst - it should actually be faster to read the same amount of logical data from x joined tables, as it is from a single denormalized table because of the smaller disk reads. To read the same amount of physical data, there could be some slight overhead.

Since query time is usually dominated by I/O costs, and the size of your data does not change (minus some very miniscule row overhead) with denormalization, there's not a tremendous amount of benefit to be had by just merging tables together. The type of denormalization that tends to increase performance, IME, is caching calculated values instead of reading the 10,000 rows required to calculate them.

清欢 2024-07-13 13:03:02

加入表的顺序非常重要。 如果您有两组数据,请尝试以首先使用最小数据集的方式构建查询,以减少查询必须处理的数据量。

对于某些数据库来说这并不重要,例如 MS SQL 在大多数情况下确实知道正确的连接顺序。
对于某些人(例如 IBM Informix)来说,顺序至关重要。

The order in which you're joining the tables is extremely important. If you have two sets of data try to build the query in a way so the smallest will be used first to reduce the amount of data the query has to work on.

For some databases it does not matter, for example MS SQL does know the proper join order most of the time.
For some (like IBM Informix) the order makes all the difference.

离笑几人歌 2024-07-13 13:03:02

当您考虑连接的复杂性类别时,决定是否进行非规范化或规范化是一个相当简单的过程。 例如,当查询为 O(k log n) 时,我倾向于使用标准化来设计数据库,其中 k 相对于所需的输出大小。

非规范化和优化性能的一个简单方法是考虑规范化结构的更改如何影响非规范化结构。 然而,这可能是有问题的,因为它可能需要事务逻辑才能在非规范化结构上工作。

由于问题巨大,关于规范化和非规范化的争论不会结束。 有许多问题需要两种方法才能自然解决。

作为一般规则,我总是存储规范化结构和可以重建的非规范化缓存。 最终,这些缓存救了我一命,解决了未来的规范化问题。

Deciding on whether to denormalize or normalize is fairly a straightforward process when you consider the complexity class of the join. For instance, I tend to design my databases with normalization when the queries are O(k log n) where k is relative to the desired output magnitude.

An easy way to denormalize and optimize performance is to think about how changes to your normalize structure affect your denormalized structure. It can be problematic however as it may require transactional logic to work on a denormalized structured.

The debate for normalization and denormalization isn't going to end since the problems are vast. There are many problems where the natural solution requires both approaches.

As a general rule, I've always stored a normalized structure and denormalized caches that can be reconstructed. Eventually, these caches save my ass to solve the future normalization problems.

2024-07-13 13:03:02

详细说明其他人所说的,

连接只是带有一些唇彩的笛卡尔积。 {1,2,3,4}X{1,2,3} 会给我们 12 种组合 (nXn=n^2)。 该计算集充当应用条件的参考。 DBMS 应用条件(例如左和右均为 2 或 3)来为我们提供匹配条件。 其实它更优化了,但问题是一样的。 集合大小的变化会以指数方式增加结果大小。 消耗的内存和 CPU 周期数量均以指数形式受到影响。

当我们非规范化时,我们完全避免了这种计算,考虑在书的每一页上贴上一张彩色便签。 您无需使用参考即可推断出信息。 我们付出的代价是我们损害了 DBMS(数据的最佳组织)的本质

Elaborating what others have said,

Joins are just cartesian products with some lipgloss. {1,2,3,4}X{1,2,3} would give us 12 combinations (nXn=n^2). This computed set acts as a reference on which conditions are applied. The DBMS applies the conditions (like where both left and right are 2 or 3) to give us the matching condition(s). Actually it is more optimised but the problem is the same. The changes to size of the sets would increase the result size exponentially. The amount of memory and cpu cycles consumed all are effected in exponential terms.

When we denormalise, we avoid this computation altogether, think of having a colored sticky, attached to every page of your book. You can infer the information with out using a reference. The penalty we pay is that we are compromising the essence of DBMS (optimal organisation of data)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文