如何优化从具有数百万行的多个表中进行选择

发布于 2024-10-03 11:22:30 字数 1888 浏览 2 评论 0原文

有以下表 (Oracle 10g):

catalog (
  id NUMBER PRIMARY KEY,
  name VARCHAR2(255),
  owner NUMBER,
  root NUMBER REFERENCES catalog(id)
  ...
)
university (
  id NUMBER PRIMARY KEY,
  ...
)
securitygroup (
  id NUMBER PRIMARY KEY
  ...
)
catalog_securitygroup (
  catalog REFERENCES catalog(id),
  securitygroup REFERENCES securitygroup(id)
)
catalog_university (
  catalog REFERENCES catalog(id),
  university REFERENCES university(id)
)

目录:500 000 行,catalog_university:500 000,catalog_securitygroup:1 500 000。

我需要从具有按当前大学和当前安全组名称排序的指定根的目录中选择任意 50 行。有一个查询:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c, catalog_securitygroup cs, catalog_university cu
        WHERE c.root = 100
          AND cs.catalog = c.id
          AND cs.securitygroup = 200
          AND cu.catalog = c.id
          AND cu.university = 300
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

Where 100 - 一些目录,200 - 一些安全组,300 - 一些大学。此查询在 3 分钟内返回约 170 000 行中的 50 行。

但下一个查询会在 2 秒内返回此行:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c
        WHERE c.root = 100
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

我构建下一个索引:(catalog.id、catalog.name、catalog.owner)、(catalog_securitygroup.catalog、catalog_securitygroup.index)、(catalog_university.catalog、catalog_university.university)。

计划第一个查询(使用 PLSQL Developer):

http://habreffect.ru/66c/f25faa5f8/plan2 .jpg

第二个查询的计划:

http://habreffect.ru/f91/86e780cc7/ plan1.jpg

有哪些方法可以优化我的查询?

Have the following tables (Oracle 10g):

catalog (
  id NUMBER PRIMARY KEY,
  name VARCHAR2(255),
  owner NUMBER,
  root NUMBER REFERENCES catalog(id)
  ...
)
university (
  id NUMBER PRIMARY KEY,
  ...
)
securitygroup (
  id NUMBER PRIMARY KEY
  ...
)
catalog_securitygroup (
  catalog REFERENCES catalog(id),
  securitygroup REFERENCES securitygroup(id)
)
catalog_university (
  catalog REFERENCES catalog(id),
  university REFERENCES university(id)
)

Catalog: 500 000 rows, catalog_university: 500 000, catalog_securitygroup: 1 500 000.

I need to select any 50 rows from catalog with specified root ordered by name for current university and current securitygroup. There is a query:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c, catalog_securitygroup cs, catalog_university cu
        WHERE c.root = 100
          AND cs.catalog = c.id
          AND cs.securitygroup = 200
          AND cu.catalog = c.id
          AND cu.university = 300
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

Where 100 - some catalog, 200 - some securitygroup, 300 - some university. This query return 50 rows from ~ 170 000 in 3 minutes.

But next query return this rows in 2 sec:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c
        WHERE c.root = 100
        ORDER BY name
    ) cc 
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

I build next indexes: (catalog.id, catalog.name, catalog.owner), (catalog_securitygroup.catalog, catalog_securitygroup.index), (catalog_university.catalog, catalog_university.university).

Plan for first query (using PLSQL Developer):

http://habreffect.ru/66c/f25faa5f8/plan2.jpg

Plan for second query:

http://habreffect.ru/f91/86e780cc7/plan1.jpg

What are the ways to optimize the query I have?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

梓梦 2024-10-10 11:22:30

有用且应该考虑处理的索引

WHERE c.root = 100
      AND cs.catalog = c.id
      AND cs.securitygroup = 200
      AND cu.catalog = c.id
      AND cu.university = 300

因此以下字段对于索引可能很有趣

c: id, root   
cs: catalog, securitygroup   
cu: catalog, university

所以,尝试创建

(catalog_securitygroup.catalog, catalog_securitygroup.securitygroup)

(catalog_university.catalog, catalog_university.university)

编辑:
我错过了 ORDER BY - 也应该考虑这些字段,因此

(catalog.name, catalog.id)

可能是有益的(或其他一些可用于排序条件的复合索引 - 可能(catalog.root、catalog.name、 Catalog.id))

编辑2
尽管另一个问题已被接受,但我将提供更多值得思考的内容。
我创建了一些测试数据并运行了一些基准测试。

测试用例在记录宽度方面是最小的(在catalog_securitygroup和catalog_university中,主键是(catalog,securitygroup)和(catalog,university))。这是每个表的记录数:

test=# SELECT (SELECT COUNT(*) FROM catalog), (SELECT COUNT(*) FROM catalog_securitygroup), (SELECT COUNT(*) FROM catalog_university);
 ?column? | ?column? | ?column? 
----------+----------+----------
   500000 |  1497501 |   500000
(1 row)

数据库是 postgres 8.4,默认 ubuntu 安装,硬件 i5,4GRAM

首先,我重写了查询以

SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root < 50 
  AND cs.catalog = c.id 
  AND cu.catalog = c.id
  AND cs.securitygroup < 200
  AND cu.university < 200
ORDER BY c.name
LIMIT 50 OFFSET 100

注意:条件变成 less then 以保持可比较的中间行数(上面的查询将返回198,801 行,没有 LIMIT 子句)

如果按照上面的方式运行,没有任何额外的索引(除了 PK 和外键),它会在冷数据库上运行 556 毫秒(这实际上表明我过度简化了示例)数据以某种方式 - 如果我在这里有 2-4 秒而不求助于 less then 运算符,我会更高兴)

这让我明白了我的观点 - 任何仅连接和过滤(一定数量的表)并仅返回一定数量的直接查询记录在任何像样的数据库上都应该在 1 秒内运行,而不需要使用游标或非规范化数据(有一天我将不得不就此写一篇文章)。

此外,如果查询仅返回 50 行并执行简单的等式连接和限制性等式条件,那么它的运行速度应该会更快。

现在让我们看看是否添加一些索引,像这样的查询中最大的潜力通常是排序顺序,所以让我尝试一下:

CREATE INDEX test1 ON catalog (name, id);

这使得查询的执行时间 - 在冷数据库上22ms

这就是的要点 - 如果您只想获取一页数据,则应该只获取一页数据和查询的执行时间,例如在标准化数据上的查询在良好的硬件上,使用适当的索引应该花费不到100毫秒。

我希望我没有将情况过度简化到无法进行比较的程度(正如我之前所说,存在一些简化,因为我不知道目录和多对多表之间关系的基数)。

所以,结论是,

  • 如果我是你,我不会停止调整索引(和 SQL),直到根据经验,查询性能低于 200 毫秒。
  • 只有当我找到一个客观的解释为什么它不能低于这个值时,我才会诉诸非规范化和/或游标等......

The indexes that can be useful and should be considered deal with

WHERE c.root = 100
      AND cs.catalog = c.id
      AND cs.securitygroup = 200
      AND cu.catalog = c.id
      AND cu.university = 300

So the following fields can be interesting for indexes

c: id, root   
cs: catalog, securitygroup   
cu: catalog, university

So, try creating

(catalog_securitygroup.catalog, catalog_securitygroup.securitygroup)

and

(catalog_university.catalog, catalog_university.university)

EDIT:
I missed the ORDER BY - these fields should also be considered, so

(catalog.name, catalog.id)

might be beneficial (or some other composite index that could be used for sorting and the conditions - possibly (catalog.root, catalog.name, catalog.id))

EDIT2
Although another question is accepted I'll provide some more food for thought.
I have created some test data and run some benchmarks.

The test cases are minimal in terms of record width (in catalog_securitygroup and catalog_university the primary keys are (catalog, securitygroup) and (catalog, university)). Here is the number of records per table:

test=# SELECT (SELECT COUNT(*) FROM catalog), (SELECT COUNT(*) FROM catalog_securitygroup), (SELECT COUNT(*) FROM catalog_university);
 ?column? | ?column? | ?column? 
----------+----------+----------
   500000 |  1497501 |   500000
(1 row)

Database is postgres 8.4, default ubuntu install, hardware i5, 4GRAM

First I rewrote the query to

SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root < 50 
  AND cs.catalog = c.id 
  AND cu.catalog = c.id
  AND cs.securitygroup < 200
  AND cu.university < 200
ORDER BY c.name
LIMIT 50 OFFSET 100

note: the conditions are turned into less then to maintain comparable number of intermediate rows (the above query would return 198,801 rows without the LIMIT clause)

If run as above, without any extra indexes (save for PKs and foreign keys) it runs in 556 ms on a cold database (this is actually indication that I oversimplified the sample data somehow - I would be happier if I had 2-4s here without resorting to less then operators)

This bring me to my point - any straight query that only joins and filters (certain number of tables) and returns only a certain number of the records should run under 1s on any decent database without need to use cursors or to denormalize data (one of these days I'll have to write a post on that).

Furthermore, if a query is returning only 50 rows and does simple equality joins and restrictive equality conditions it should run even much faster.

Now let's see if I add some indexes, the biggest potential in queries like this is usually the sort order, so let me try that:

CREATE INDEX test1 ON catalog (name, id);

This makes execution time on the query - 22ms on a cold database.

And that's the point - if you are trying to get only a page of data, you should only get a page of data and execution times of queries such as this on normalized data with proper indexes should take less then 100ms on decent hardware.

I hope I didn't oversimplify the case to the point of no comparison (as I stated before some simplification is present as I don't know the cardinality of relationships between catalog and the many-to-many tables).

So, the conclusion is

  • if I were you I would not stop tweaking indexes (and the SQL) until I get the performance of the query to go below 200ms as rule of the thumb.
  • only if I would find an objective explanation why it can't go below such value I would resort to denormalisation and/or cursors, etc...
梦途 2024-10-10 11:22:30

首先,我假设您的 University 和 SecurityGroup 表相当小。您发布了大表的大小,但实际上是其他大小是问题的一部分

您的问题在于您无法首先连接最小的表。您的加入顺序应该是从小到大。但由于您的映射表不包含安全组到大学的表,因此您无法首先加入最小的映射表。所以你最终会从一个或另一个开始,到一个大表,到另一个大表,然后有了那个大的中间结果,你必须去一个小表。

如果您始终将 current_univ 和 current_secgrp 以及 root 作为输入,那么您希望尽快使用它们进行过滤。唯一的方法是更改​​您的架构。事实上,如果有必要,您可以保留现有的表格,但您将根据此建议添加到空间中。

您已经很好地标准化了数据。这对于更新速度来说非常好......对于查询来说不太好。我们进行非规范化以加快查询速度(这就是数据仓库的全部原因(好吧,还有历史))。构建一个包含以下列的映射表。

Univ_id、SecGrp_ID、根、catalog_id。使其成为前 3 列作为 pk 的索引组织表。

现在,当您使用所有三个 PK 值查询该索引时,您将使用允许的目录 ID 的完整列表完成索引扫描,现在只需与 cat 表进行一次连接即可获取 cat 项目详细信息,然后您就可以完成索引扫描了。跑步。

First I assume that your University and SecurityGroup tables are rather small. You posted the size of the large tables but it's really the other sizes that are part of the problem

Your problem is from the fact that you can't join the smallest tables first. Your join order should be from small to large. But because your mapping tables don't include a securitygroup-to-university table, you can't join the smallest ones first. So you wind up starting with one or the other, to a big table, to another big table and then with that large intermediate result you have to go to a small table.

If you always have current_univ and current_secgrp and root as inputs you want to use them to filter as soon as possible. The only way to do that is to change your schema some. In fact, you can leave the existing tables in place if you have to but you'll be adding to the space with this suggestion.

You've normalized the data very well. That's great for speed of update... not so great for querying. We denormalize to speed querying (that's the whole reason for datawarehouses (ok that and history)). Build a single mapping table with the following columns.

Univ_id, SecGrp_ID, Root, catalog_id. Make it an index organized table of the first 3 columns as pk.

Now when you query that index with all three PK values, you'll finish that index scan with a complete list of allowable catalog Id, now it's just a single join to the cat table to get the cat item details and you're off an running.

伏妖词 2024-10-10 11:22:30

Oracle 基于成本的优化器利用它所拥有的所有信息来决定数据的最佳访问路径以及获取该数据的成本最低的方法。以下是与您的问题相关的一些随机点。

您列出的前三个表都有主键。其他表(catalog_university 和catalog_securitygroup)是否也有主键?主键定义非空且唯一的一列或一组列,在关系数据库中非常重要。

Oracle 通常通过在给定列上生成唯一索引来强制使用主键。 Oracle 优化器更有可能使用唯一索引(如果可用),因为它更有可能更具选择性。

如果可能,包含唯一值的索引应定义为唯一(CREATE UNIQUE INDEX...),这将为优化器提供更多信息。

您提供的附加索引并不比现有索引更具选择性。例如,(catalog.id、catalog.name、catalog.owner) 上的索引是唯一的,但不如 (catalog.id) 上的现有主键索引有用。如果编写一个查询来选择catalog.name列,则可以执行索引跳过扫描,但这开始成本高昂(在这种情况下大多数甚至不可能)。

由于您尝试基于catalog.root 列进行选择,因此可能值得在该列上添加索引。这意味着它可以快速从目录表中找到相关行。第二次查询的时间可能有点误导。从目录中查找 50 个匹配行可能需要 2 秒,但这些很可能是目录表中的前 50 行......查找与所有条件匹配的 50 行可能需要更长的时间,这不仅仅是因为您需要加入其他表来获取它们。在尝试性能调整时,我总是会使用create table as select而不限制 rownum 。对于复杂的查询,我通常会关心返回所有行需要多长时间...而使用 rownum 进行简单的选择可能会产生误导

关于 Oracle 性能调优的一切都是为了向优化器提供足够的信息和正确的工具(索引、限制等)以正确完成其工作。因此,使用诸如 DBMS_STATS.GATHER_TABLE_STATS() 之类的方法获取优化器统计信息非常重要。在 Oracle 10g 或更高版本中,索引应该自动收集统计信息。

不知何故,这变成了一个关于 Oracle 优化器的相当长的答案。希望其中一些能回答您的问题。以下是上述内容的总结:

  • 为优化器提供尽可能多的信息,例如,如果索引是唯一的,则如此声明。
  • 在访问路径上添加索引
  • 查找正确的查询次数,不受 rowwnum 限制。在罐子里找到前 50 个 M&M 总是比找到前 50 个红色 M&M 更快
  • 收集优化器统计信息
  • 在所有存在的表上添加唯一/主键。

The Oracle cost-based optimizer makes use of all the information that it has to decide what the best access paths are for the data and what the least costly methods are for getting that data. So below are some random points related to your question.

The first three tables that you've listed all have primary keys. Do the other tables (catalog_university and catalog_securitygroup) also have primary keys on them?? A primary key defines a column or set of columns that are non-null and unique and are very important in a relational database.

Oracle generally enforces a primary key by generating a unique index on the given columns. The Oracle optimizer is more likely to make use of a unique index if it available as it is more likely to be more selective.

If possible an index that contains unique values should be defined as unique (CREATE UNIQUE INDEX...) and this will provide the optimizer with more information.

The additional indexes that you have provided are no more selective than the existing indexes. For example, the index on (catalog.id, catalog.name, catalog.owner) is unique but is less useful than the existing primary key index on (catalog.id). If a query is written to select on the catalog.name column, it is possible to do and index skip scan but this starts being costly (and most not even be possible in this case).

Since you are trying to select based in the catalog.root column, it might be worth adding an index on that column. This would mean that it could quickly find the relevant rows from the catalog table. The timing for the second query could be a bit misleading. It might be taking 2 seconds to find 50 matching rows from catalog, but these could easily be the first 50 rows from the catalog table..... finding 50 that match all your conditions might take longer, and not just because you need to join to other tables to get them. I would always use create table as select without restricting on rownum when trying to performance tune. With a complex query I would generally care about how long it take to get all the rows back... and a simple select with rownum can be misleading

Everything about Oracle performance tuning is about providing the optimizer enough information and the right tools (indexes, constraints, etc) to do its job properly. For this reason it's important to get optimizer statistics using something like DBMS_STATS.GATHER_TABLE_STATS(). Indexes should have stats gathered automatically in Oracle 10g or later.

Somehow this grew into quite a long answer about the Oracle optimizer. Hopefully some of it answers your question. Here is a summary of what is said above:

  • Give the optimizer as much information as possible, e.g if index is unique then declare it as such.
  • Add indexes on your access paths
  • Find the correct times for queries without limiting by rowwnum. It will always be quicker to find the first 50 M&Ms in a jar than finding the first 50 red M&Ms
  • Gather optimizer stats
  • Add unique/primary keys on all tables where they exist.
浅听莫相离 2024-10-10 11:22:30

rownum 的使用是错误的,会导致所有行都被处理。它将处理所有行,为它们分配一个行号,然后找到 0 到 50 之间的行。当你想在解释计划中查找时,是 COUNT STOPKEY 而不仅仅是 count

下面的查询应该是一个改进,因为它只会获得前 50 行...但仍然存在连接问题需要考虑:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c
        WHERE c.root = 100
        ORDER BY name
    ) cc 
    where rownum <= 50
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

另外,假设这是一个网页或类似的东西,也许有更好的方法来处理这不仅仅是再次运行查询来获取下一页的数据。

The use of rownum is wrong and causes all the rows to be processed. It will process all the rows, assigned them all a row number, and then find those between 0 and 50. When you want to look for in the explain plan is COUNT STOPKEY rather than just count

The query below should be an improvement as it will only get the first 50 rows... but there is still the issue of the joins to look at too:

SELECT ccc.* FROM (
  SELECT cc.*, ROWNUM AS n FROM (
      SELECT c.id, c.name, c.owner
        FROM catalog c
        WHERE c.root = 100
        ORDER BY name
    ) cc 
    where rownum <= 50
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;

Also, assuming this for a web page or something similar, maybe there is a better way to handle this than just running the query again to get the data for the next page.

别闹i 2024-10-10 11:22:30

尝试声明一个游标。我不知道oracle,但在SqlServer中会是这样的:

declare @result 
table ( 
    id numeric,
    name varchar(255)
); 

declare __dyn_select_cursor cursor LOCAL SCROLL DYNAMIC for 

--Select
select distinct 
    c.id, c.name
From [catalog] c
    inner join university u
    on     u.catalog = c.id
       and u.university = 300
    inner join catalog_securitygroup s
    on     s.catalog = c.id
       and s.securitygroup = 200
Where
    c.root = 100
Order by name   

--Cursor
declare @id numeric;
declare @name varchar(255);

open __dyn_select_cursor; 

fetch relative 1 from __dyn_select_cursor into @id,@name declare @maxrowscount int 

set @maxrowscount = 50

while (@@fetch_status = 0 and @maxrowscount <> 0) 
begin 
     insert into @result values (@id, @name);
     set @maxrowscount = @maxrowscount - 1;
     fetch next from __dyn_select_cursor into  @id, @name; 
end 
close __dyn_select_cursor; 
deallocate __dyn_select_cursor; 


--Select temp, final result
select 
 id, 
 name
from @result; 

try to declare a cursor. I dont know oracle, but in SqlServer would look like this:

declare @result 
table ( 
    id numeric,
    name varchar(255)
); 

declare __dyn_select_cursor cursor LOCAL SCROLL DYNAMIC for 

--Select
select distinct 
    c.id, c.name
From [catalog] c
    inner join university u
    on     u.catalog = c.id
       and u.university = 300
    inner join catalog_securitygroup s
    on     s.catalog = c.id
       and s.securitygroup = 200
Where
    c.root = 100
Order by name   

--Cursor
declare @id numeric;
declare @name varchar(255);

open __dyn_select_cursor; 

fetch relative 1 from __dyn_select_cursor into @id,@name declare @maxrowscount int 

set @maxrowscount = 50

while (@@fetch_status = 0 and @maxrowscount <> 0) 
begin 
     insert into @result values (@id, @name);
     set @maxrowscount = @maxrowscount - 1;
     fetch next from __dyn_select_cursor into  @id, @name; 
end 
close __dyn_select_cursor; 
deallocate __dyn_select_cursor; 


--Select temp, final result
select 
 id, 
 name
from @result; 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文