Oracle 如何使用统计数据

发布于 2024-08-13 04:25:31 字数 341 浏览 9 评论 0原文

上一个问题中,我得到了有关 Oracle 统计数据的评论:

Oracle 不知道 50M 大于行数。当然,它有统计数据,但它们可能是旧的和错误的 - Oracle 绝不会仅仅因为统计数据错误就允许自己提供不正确的结果

我非常确定 Oracle 在准备查询执行计划时依赖于统计数据。在版本 10 之前,建议不时刷新统计信息,从 10g 开始,Oracle 自动收集统计信息。

有人能解释一下 Oracle 查询分析器在多大程度上依赖于统计数据吗?

In previous question I got comment about Oracle statistics:

Oracle doesn't know that 50M is greater than the number of rows. Sure, it has statistics, but they could be old and wrong - and Oracle would never allow itself to deliver an incorrect result only because the statistics are wrong

I was pretty sure that Oracle relies on statistics when preparing query execution plan. Before version 10 it was recommended to refresh statistics from time to time and from 10g Oracle gathers statistics automatically.

Can somebody explain how much Oracle query analyzer relies on statistics data?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

近箐 2024-08-20 04:25:31

Oracle 大量使用统计信息来生成查询执行计划。它不会(也不应该)做的是以影响查询结果的方式使用这些统计信息,这就是您试图对“ROWNUM < 50000000”执行的操作。统计数据可能已过时或丢失。然而,这仅意味着Oracle生成正确结果的速度可能很慢,并不意味着Oracle将返回不正确的结果。

如果 Oracle 按您希望的方式工作,那么它可能会认为“ROWNUM < 50000000”意味着“获取所有行”,即使该表现在包含 60,000,000 行(但过时的统计数据表明它只包含 49,000,000 行)。幸运的是事实并非如此。

Oracle uses statistics a lot, to generate query execution plans. What it does not (and should not) do is use those statistics in a way that will affect query results, which is what you were trying to do with "ROWNUM < 50000000". The statistics may be out of date, or missing. However, this will only mean that Oracle may be slow to generate the correct result, it does not mean that Oracle will return an incorrect result.

If Oracle worked as you hoped, then it might decide that "ROWNUM < 50000000" meant "get all rows" even though the table now contained 60,000,000 rows (but had out of date stats saying it contained only 49,000,000). Fortunately it does not.

梦萦几度 2024-08-20 04:25:31

统计信息对于查询优化器非常重要。应自动或手动定期收集它们。

执行查询时,Oracle 将生成一个可用执行计划池以满足您的查询。从这些执行计划将返回相同的精确结果的角度来看,这些执行计划是相同的,只是实现这一目标的道路对于一个计划来说可能比另一个计划更有效。为了确定这种效率,Oracle 使用每个执行计划中使用的对象生成的统计信息来确定它们各自的成本。如果这些统计数据不存在或已过时,则与每个计划相关的成本将不太准确,因此可能不会选择最佳计划。

以下是 Oracle 用于确定此成本的一些关键统计数据:

表格统计

 * 行数
 * 块数
 * 平均行长    

列统计

 * 列中不同值 (NDV) 的数量
 * 列中空值的数量
 * 数据分布(直方图)
 * 扩展统计

指数统计

* 叶子块的数量
* 级别
* 聚类因子

系统统计

* I/O 性能和利用率
* CPU性能和利用率

Statistics are VERY important for the query optimizer. They should be gathered on a regular basis either automatically or manually.

When executing a query Oracle will produce a pool of available execution plans in order to satisfy your query. Those execution plans are the same from the standpoint that they will return you the same exact result, it's just the road to getting there may be far more efficient for one plan over another. To determine this efficiency, Oracle uses the stats generated on the objects used in each of the execution plans to determine their individual costs. If those stats are non-existent or are stale, the cost associated with each plan will be less accurate and therefore the optimal plan may not be chosen.

Here are some of the key stats that Oracle uses for determining this cost:

Table statistics

 * Number of rows
 * Number of blocks
 * Average row length    

Column statistics

 * Number of distinct values (NDV) in column
 * Number of nulls in column
 * Data distribution (histogram)
 * Extended statistics

Index statistics

* Number of leaf blocks
* Levels
* Clustering factor

System statistics

* I/O performance and utilization
* CPU performance and utilization
橘虞初梦 2024-08-20 04:25:31

Oracle 基于成本的优化器 (CBO) 使用统计信息来计算执行查询的不同方式的相对成本,以便选择最合适的一种。

总体而言,该方法运行良好,并且正在不断改进。例如,在 11g 中,您可以收集多列直方图,这对在相关列上进行谓词的查询有很大帮助(例如,强相关性,如出生月份和星座,或更弱相关性,如性别和身高)。

然而它并不完美。例如,估计两个表之间联接的结果集的基数是相当准确的,就像估计过滤操作的基数一样,但是将两者结合起来需要大量的估计,而这些估计很容易不准确。在某些情况下,可以通过提示或使用中间结果集的全局临时表来解决这些问题。

统计数据的另一个问题是,更改它们可能会改变执行计划,因此最近出现了更多的运动,要么阻止持续收集统计数据,要么在实施统计数据更改之前分析其影响。

查找乔纳森·刘易斯 (Jonathan Lewis) 的书——它对这个主题进行了非常彻底的处理。

Statistics are used by the oracle cost based optimizer (CBO) to calculate the relative costs of different ways of executing a query so that the most appropriate one can be chosen.

On the whole this works very well, and is being continually improved. For example in 11g you can gather multicolumn histograms that help greatly with queries having predicates on correlated columns (eg. strongly correlated like month of birth and star sign, or more weakly correlted like gender and height).

However it is not perfect. For example estimating the cardinality of the result set of a join between two tables is reasonably accurate, as is estimating the cardinality from a filter operation, but combining the two requires a lot of estimation that can easily be inaccurate. In some cases these issues can be worked around with hints, or with the use of global temporary tables for intermediate result sets.

Another problem of statistics is that changing them can change the execution plan, so there is more of a movement recently to either discourage continual gathering of statistics, or to analyse the impact of changes to statistics before implementing them.

Look for the Jonathan Lewis book -- it is a very thorough treatment of the subject.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文