Oracle SQL 中的索引、EXPLAIN PLAN 和记录访问

发布于 2024-12-16 02:36:07 字数 1319 浏览 3 评论 0原文

我一直在学习 Oracle SQL 中的索引，我想用一个测试表进行一个小实验，看看索引到底是如何工作的。正如我从之前发表的一篇文章中发现的那样，最好的方法是使用 EXPLAIN PLAN。然而，我遇到了一些让我困惑的事情。

我的示例表包含属性（EmpID、Fname、Lname、职业等）。我使用自己编写的 java 程序填充了 500,000 条记录（随机名称、职业等）。现在，这里有一些带索引和不带索引的示例查询：

NO INDEX:

SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

EXPLAIN PLAN 说：

OPERATION                         OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE  ANALYZED  1169

现在我创建索引：

CREATE INDEX occupation_idx
    ON EMPLOYEE (Occupation);

WITH INDEX "ocupation_idx":

SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

EXPLAIN PLAN 说：

OPERATION                         OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE  ANALYZED  1169

那么...成本仍然相同，1169？现在我尝试这样做：

WITH INDEX "ocupation_idx":

SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

EXPLAIN PLAN 说：

OPERATION                              OPTIMIZER COST
INDEX(RANGE SCAN) TEST.OCCUPATION_IDX  ANALYZED  67

因此，看来只有当该列是我从中提取值的唯一列时才会使用索引。但我认为索引的意义在于使用索引列作为密钥来解锁整个记录？上面的搜索是毫无意义的......它搜索您已经知道的值。我能想到的唯一有价值的查询（仅涉及索引列的值（而不是记录的其余部分））是诸如 COUNT 之类的聚合。

我缺少什么？

原文

I have been learning about indexes in Oracle SQL, and I wanted to conduct a small experiment with a test table to see how indexes really worked. As I discovered from an earlier post made here, the best way to do this is with EXPLAIN PLAN. However, I am running into something which confuses me.

My sample table contains attributes (EmpID, Fname, Lname, Occupation, .... etc). I populated it with 500,000 records using a java program I wrote (random names, occupations, etc). Now, here are some sample queries with and without indexes:

NO INDEX:

SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

EXPLAIN PLAN says:

OPERATION                         OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE  ANALYZED  1169

Now I create index:

CREATE INDEX occupation_idx
    ON EMPLOYEE (Occupation);

WITH INDEX "occupation_idx":

SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

EXPLAIN PLAN says:

OPERATION                         OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE  ANALYZED  1169

So... the cost is STILL the same, 1169? Now I try this:

WITH INDEX "occupation_idx":

SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

EXPLAIN PLAN says:

OPERATION                              OPTIMIZER COST
INDEX(RANGE SCAN) TEST.OCCUPATION_IDX  ANALYZED  67

So, it appears that the index only is utilized when that column is the only one I'm pulling values from. But I thought that the point of an index was to unlock the entire record using the indexed column as the key? The search above is a pretty pointless one... it searches for values which you already know. The only worthwhile query I can think of which ONLY involves an indexed column's value (and not the rest of the record) would be an aggregate such as COUNT or something.

What am I missing?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

被你宠の有点坏 2024-12-23 02:36:07

即使有了索引，Oracle 仍决定对第二个查询进行完整扫描。

它为什么这么做？ Oracle 会创建两个计划并给出每个计划的成本：-

1) 全面扫描

2) 索引访问

Oracle 选择成本较低的计划。显然，它提出了全面扫描作为较低的成本。

如果你想查看索引计划的成本，你可以做一个带有这样提示的解释计划来强制使用索引：

SELECT /*+ INDEX(EMPLOYEE occupation_idx) */ Fname
FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

如果你对上面的内容做一个解释计划，你会看到成本大于全部扫描费用。这就是Oracle没有选择使用索引的原因。

考虑索引计划成本的一个简单方法是： -

索引的 blevel（必须从上到下读取多少个块）
随后必须读取索引中匹配的记录的表块的数量。这依赖于 Oracle 对从事“DOCTOR”职业的员工人数的估计。在您的简单示例中，这将是：
行数/不同值的数量

更复杂的考虑因素包括集群工厂和索引成本调整，它们都反映了读取的块已经在内存中，因此不需要读取的可能性从磁盘。

也许您可以使用带有索引提示的查询结果以及此查询的结果来更新您的问题：-

SELECT COUNT(*), COUNT(DISTINCT( Occupation ))
FROM EMPLOYEE;

这将允许人们评论索引计划的成本。

Even with your index, Oracle decided to do a full scan for the second query.

Why did it do this? Oracle would have created two plans and come up with a cost for each:-

1) Full scan

2) Index access

Oracle selected the plan with the lower cost. Obviously it came up with the full scan as the lower cost.

If you want to see the cost of the index plan, you can do an explain plan with a hint like this to force the index usage:

SELECT /*+ INDEX(EMPLOYEE occupation_idx) */ Fname
FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

If you do an explain plan on the above, you will see that the cost is greater than the full scan cost. This is why Oracle did not choose to use the index.

A simple way to consider the cost of the index plan is:-

The blevel of the index (how many blocks must be read from top to bottom)
The number of table blocks that must be subsequently read for records matching in the index. This relies on Oracle's estimate of the number of employees that have an occupation of 'DOCTOR'. In your simple example, this would be:
number of rows / number of distinct values

More complicated considerations include the clustering factory and index cost adjustments which both reflect the likelyhood that a block that is read is already in memory and hence does not need to read from disk.

Perhaps you could update your question with the results from your query with the index hint and also the results of this query:-

SELECT COUNT(*), COUNT(DISTINCT( Occupation ))
FROM EMPLOYEE;

This will allow people to comment on the cost of the index plan.

回复收藏 0 原文

﹉夏雨初晴づ 2024-12-23 02:36:07

我想我明白这里发生了什么。

当您拥有适当的索引并且您执行以下操作时：

SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

执行计划将使用该索引。这是显而易见的，因为满足查询所需的所有数据都在索引中，Oracle 甚至根本不需要引用该表。

但是，当您这样做时：

SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

那么，如果 Oracle 使用索引，它将执行 INDEX RANGE SCAN，然后执行 TABLE ACCESS BY ROWID 来查找与该职业相对应的 Fname。现在，根据 DOCTOR 占用的行数，Oracle 将必须对表进行一次或多次访问，以查找 Fname。例如，如果您有一个表，并且所有员工的职业都设置为“DOCTOR”，则索引没有多大用处，Oracle 将简单地对该表进行全表扫描。如果有 10,000 名员工，并且只有一名是 DOCTOR，那么同样，这是理所当然的，Oracle 将使用该索引。

但当你处于这两个极端之间时，就会有一些微妙之处。人们在讨论是否使用索引时喜欢谈论“选择性”，即索引识别了多少行，与表的大小。但是，事实并非如此。 Oracle真正关心的是块选择性。也就是说，它必须访问多少块才能满足查询？那么，首先，范围扫描有多“宽”？谓词值指定的值范围越有限越好。其次，当您的查询需要进行表查找时，它必须访问多少个不同的块才能找到所需的所有数据。也就是说，表中的数据相对于索引顺序的“随机”程度如何？这称为 CLUSTERING_FACTOR。如果您分析索引以收集统计信息，然后查看 USER_INDEXES，您将看到 CLUSTERING_FACTOR 现已填充。

那么，什么是 CLUSTERING_FACTOR？ CLUSTERING_FACTOR 是表相对于索引的键列的“有序性”。 CLUSTERING_FACTOR 的值始终介于表中的块数和表中的行数之间。低 CLUSTERING_FACTOR，即非常接近表中块的数量，表示表相对于索引非常有序。高 CLUSTERING_FACTOR，即非常接近表中行数的 CLUSTERING_FACTOR，相对于索引来说是非常无序的。

CLUSTERING_FACTOR 描述了表中数据相对于索引的顺序，这是一个需要理解的重要概念。因此，例如，重建索引将不会更改CLUSTERING_FACTOR。同样重要的是要理解，同一个表可能有两个索引，一个可能具有出色的 CLUSTERING_FACTOR，而另一个可能具有极差的 CLUSTERING_FACTOR。桌子本身只能以一种方式订购。

那么，为什么我要花这么多时间来描述 CLUSTERING_FACTOR？因为当您有一个执行计划，先执行 INDEX RANGE SCAN，然后执行 TABLE ACCESS BY ROWID，您可以确定 Oracle 优化器已经考虑了 CLUSTERING_FACTOR，以得出执行计划。例如，假设您有一个包含 10,000 行的表，并且其中 100 行的 Occupation = 'DOCTOR'。您在上面编写了查询，询问职业为 DOCTOR 的员工的 Fname。那么，Oracle 可以非常轻松高效地确定占用为 DOCTOR 的行的 rowid。但是，Oracle 需要访问多少表块来进行 Fname 查找？如果数据按表中的职业聚集（排序），则可能只有 1 或 2 个表块。但是，如果表中的数据非常无序，则可能多达 100 个！因此，同样，有 10,000 行表，并且我们假设（为了说明和简单数学的目的）该表有 100 行/块，因此有 100 个块。根据表顺序（即 CLUSTERING_FACTOR），表块访问次数可能少至 1，多至 100。

因此，我希望这可以帮助您理解为什么优化器在某些情况下可能不愿意使用索引。

I think I see what's happening here.

When you have the index in place, and you do:

SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

The execution plan will use the index. This is a no-brainer, cause all the data that's required to satisfy the query is right there in the index, and Oracle never even has to reference the table at all.

However, when you do:

SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';

then, if Oracle uses the index, it will do an INDEX RANGE SCAN followed by a TABLE ACCESS BY ROWID to look up the Fname that corresponds to that Occupation. Now, depending on how many rows have DOCTOR for Occupation, Oracle will have to make one or more trips to the table, to look up the Fname. If, for example, you have a table, and all the employees have Occupation set to 'DOCTOR', the index isn't of much use, and Oracle will simply do a FULL TABLE SCAN of the table. If there are 10,000 employees, and only one is a DOCTOR, then again, it's a no-brainer, and Oracle will use the index.

But there are some subtleties, when you're somewhere between those two extremes. People like to talk about 'selectivity', i.e., how many rows are identifed by the index, vs. the size of the table, when discussing whether the index will be used. But, that's not really true. What Oracle really cares about is block selectivity. That is, how many blocks does it have to visit, to satisfy the query? So, first, how "wide" is the RANGE SCAN? The more limited the range of values specified by the predicate values, the better. Second, when your query needs to do table lookups, how many different blocks will it have to visit to find all the data it needs. That is, how "random" is the data in the table relative to the index order? This is called the CLUSTERING_FACTOR. If you analyze the index to collect statistics, and then look at USER_INDEXES, you'll see that the CLUSTERING_FACTOR is now populated.

So, what's CLUSTERING_FACTOR? CLUSTERING_FACTOR is the "orderedness" of the table, with respect to the index's key column(s). The value of CLUSTERING_FACTOR will always be between the number of blocks in a table and the number of rows in a table. A low CLUSTERING_FACTOR, that is, one that is very near to the number of blocks in the table, indicates a table that's very ordered, relative to the index. A high CLUSTERING_FACTOR, that is, one that is very near to the number of rows in the table, is very unordered, relative to the index.

It's an important concept to understand that the CLUSTERING_FACTOR describes the order of data in the table relative to the index. So, rebuilding an index, for example, will not change the CLUSTERING_FACTOR. It's also important to understand that the same table could have two indexes, and one could have an excellent CLUSTERING_FACTOR, and the other could have an extremely poor CLUSTERING_FACTOR. The table itself can only be ordered in one way.

So, why have I spent so much time describing CLUSTERING_FACTOR? Because when you have an execution plan that does an INDEX RANGE SCAN followed by TABLE ACCESS BY ROWID, you can be sure that the CLUSTERING_FACTOR has been considered by Oracle's optimizer, to come up with the execution plan. For example, suppose you have a 10,000 row table, and suppose 100 of the rows have Occupation = 'DOCTOR'. You write the query above, asking for the Fname of the employees whose occupation is DOCTOR. Well, Oracle can very easily and efficiently determine the rowids of the rows where occupation is DOCTOR. But, how many table blocks will Oracle need to visit, to do the Fname lookup? It could be only 1 or 2 table blocks, if the data is clustered (ordered) by Occupation in the table. But, it could be as many as 100, if the data is very unordered in the table! So, again, 10,000 row table, and, let's assume, (for the purposes of illustration and simple math) that the table has 100 rows/block, and so, 100 blocks. Depending on table order (i.e. CLUSTERING_FACTOR), the number of table block visits could be as few as 1, or as many as 100.

So, I hope this helps you understand why the optimizer may be reluctant to use an index in some cases.

回复收藏 0 原文

陈独秀 2024-12-23 02:36:07

索引是表的副本，仅存储以下数据：

索引字段
指向原始行 (rowid) 的指针。

假设您有一个像这样的表：

rowid    id  name  occupation
[1]      1   John  clerk
[2]      2   Jim   manager
[3]      3   Jane  boss

那么职业上的索引将如下所示：

occupation  rowid
boss        [3]
manager     [2]
clerk       [1]

，其中记录在B树中按职业排序>。

如您所见，如果只选择索引字段，则只需要索引（第二个表）。

如果选择职业以外的任何内容：

SELECT  *
FROM    mytable
WHERE   occupation = 'clerk'

那么引擎应该做两件事：首先在索引中查找相关记录，第二，通过rowid在原始表中查找记录。就像您在 rowid 上连接两个表一样。

由于索引中的 rowid 不按顺序排列，因此对原始表的读取不是连续的并且可能会很慢。按顺序读取原始表并仅使用 occupation = 'clerk' 过滤记录可能会更快。

引擎不会“解锁”记录：它只是在索引中查找rowid，如果索引本身没有足够的数据，则通过找到的rowid在原始表中查找数据。

An index is the copy of the table which only stores the following data:

Indexed field(s)
A pointer to the original row (rowid).

Say you have a table like this:

rowid    id  name  occupation
[1]      1   John  clerk
[2]      2   Jim   manager
[3]      3   Jane  boss

Then an index on occupation would look like this:

occupation  rowid
boss        [3]
manager     [2]
clerk       [1]

, with the records sorted on occupation in a B-Tree.

As you can see, if you only select the indexed fields, you only need the index (the second table).

If you select anything other than occupation:

SELECT  *
FROM    mytable
WHERE   occupation = 'clerk'

then the engine should make two things: first find the relevant records in the index, second, find the records in the original table by rowid. It's like if you joined the two tables on rowid.

Since the rowids in the index are not in order, the reads to the original table are not sequential and can be slow. It may be faster to read the original table in sequential order and just filter the records with occupation = 'clerk'.

The engine does not "unlock" the records: it just finds the rowid in the index, and if there are not enough data in the index itself, it looks up data in the original table by the rowid found.

回复收藏 0 原文