PostgreSQL 多列索引连接与比较(“<”和“>”)运算符

发布于 2024-10-16 02:00:25 字数 2355 浏览 0 评论 0 原文

我正在尝试利用 PostgreSQL 中的多列 btree 索引在两个表之间执行烦人的联接。

               Table "revision_main"
     Column     |          Type          | Modifiers 
----------------+------------------------+-----------
 revision_id    | integer                | 
 page_id        | integer                | 

Indexes:
    "revision_main_pkey" UNIQUE, btree (revision_id)
    "revision_main_cluster_idx" btree (page_id, "timestamp") CLUSTER

该表包含对 wiki 中页面的修订(约 3 亿行)。我的表中有更多列,但我在本示例中放弃了它们,因为它们不重要。

               Table "revert"
       Column       |  Type   | Modifiers 
--------------------+---------+-----------
 page_id            | integer | 
 revision_id        | integer | 
 reverted_to        | integer | 
Indexes:
    "revert_page_between_idx" btree (page_id, reverted_to, revision_id) CLUSTER

该表包含恢复修订(约 2200 万行)。如果修订已被还原,则该 revision_id 将在 revision_main 表中具有一行,并且其 revision_id 将位于 reverted_to 和 revision_id 之间,并共享相同的 page_id。 (请参阅http://en.wikipedia.org/wiki/Wikipedia:Revert,如果您很好奇。)

连接这两个表来获取恢复的修订版本似乎很简单。以下是我的想法:

explain SELECT
    r.revision_id,
    rvt.revision_id
FROM revision_main r
INNER JOIN revert rvt 
    ON r.page_id = rvt.page_id 
    AND r.revision_id > rvt.reverted_to
    AND r.revision_id < rvt.revision_id;
                                       QUERY PLAN                                               
----------------------------------------------------------------------------------------------------
 Merge Join  (cost=4202878.87..15927491478.57 rows=88418194298 width=8)
   Merge Cond: (r.page_id = rvt.page_id)
   Join Filter: ((r.revision_id > rvt.reverted_to) AND (r.revision_id < rvt.revision_id))
   ->  Index Scan using revision_main_page_id_idx on revision_main r  (cost=0.00..9740790.61 rows=223163392 width=8)
   ->  Materialize  (cost=4201592.06..4536465.21 rows=26789852 width=12)
         ->  Sort  (cost=4201592.06..4268566.69 rows=26789852 width=12)
               Sort Key: rvt.page_id
               ->  Seq Scan on revert rvt  (cost=0.00..438534.52 rows=26789852 width=12)

尽管恢复时的聚集索引应该是 Btree 索引(因此支持“<”和“>”等比较运算符),但查询优化器不会将该索引用于join 和“explain”预计总成本将超过 150 亿(可能明年完成)。

比较运算符是否无法与多列(btree)索引一起使用?我只是做错了吗?

I'm trying to take advantage of a multi-column btree index in PostgreSQL to perform an annoying join between two tables.

               Table "revision_main"
     Column     |          Type          | Modifiers 
----------------+------------------------+-----------
 revision_id    | integer                | 
 page_id        | integer                | 

Indexes:
    "revision_main_pkey" UNIQUE, btree (revision_id)
    "revision_main_cluster_idx" btree (page_id, "timestamp") CLUSTER

This table contains revisions (~ 300 Million rows) to pages in a wiki. There are more columns in my table, but I've discarded them for this example because they shouldn't matter.

               Table "revert"
       Column       |  Type   | Modifiers 
--------------------+---------+-----------
 page_id            | integer | 
 revision_id        | integer | 
 reverted_to        | integer | 
Indexes:
    "revert_page_between_idx" btree (page_id, reverted_to, revision_id) CLUSTER

This table contains reverting revisions (~22 Million rows). If a revisions has been reverted, that revision_id will have a row in the revision_main table and its revision_id will be between reverted_to and revision_id as well as share the same page_id. (See http://en.wikipedia.org/wiki/Wikipedia:Revert if you are curious.)

Joining these two tables to get reverted revisions seems straightforward. Here is what I've come up with:

explain SELECT
    r.revision_id,
    rvt.revision_id
FROM revision_main r
INNER JOIN revert rvt 
    ON r.page_id = rvt.page_id 
    AND r.revision_id > rvt.reverted_to
    AND r.revision_id < rvt.revision_id;
                                       QUERY PLAN                                               
----------------------------------------------------------------------------------------------------
 Merge Join  (cost=4202878.87..15927491478.57 rows=88418194298 width=8)
   Merge Cond: (r.page_id = rvt.page_id)
   Join Filter: ((r.revision_id > rvt.reverted_to) AND (r.revision_id < rvt.revision_id))
   ->  Index Scan using revision_main_page_id_idx on revision_main r  (cost=0.00..9740790.61 rows=223163392 width=8)
   ->  Materialize  (cost=4201592.06..4536465.21 rows=26789852 width=12)
         ->  Sort  (cost=4201592.06..4268566.69 rows=26789852 width=12)
               Sort Key: rvt.page_id
               ->  Seq Scan on revert rvt  (cost=0.00..438534.52 rows=26789852 width=12)

Even though the clustered index on revert should be a Btree index (and thus support comparison operators like "<" and ">"), the query optimizer does not use the index for the join and "explain" predicts a total cost of over 15 billion (might be done next year).

Are comparison operators impossible to use with multi-column (btree) indexes? Am I just doing it wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

祁梦 2024-10-23 02:00:25

看起来优化器比你更了解它的工作。

如果您选择的不仅仅是表的一小部分(哪个部分取决于硬件,假设为 5%),那么选择和排序整个表比使用索引更快。如果您只是选择几行,那么它应该使用索引。因此它为您的数据提供了正确的查询计划。

至于总成本,这些数字都是废话,只有在单个查询中相互比较时才有用。 (两个非常相似的查询产生的总成本可能有很大不同。)执行时间和查询成本几乎不相关。

It looks like the optimizer knows its job better than you do.

If you are selecting more than a small fraction of a table (what fraction is hardware dependent, let's say 5%), then it is faster to select and order the whole table than it is to use an index. If you were just selecting a few rows, then it should use the index. So it is giving you the correct query plan for your data.

As for the total cost, those numbers are all BS and are only useful when compared in relation to each other, within a single query. (The total costs produced by two very similar queries can be on a very different scale.) The time to execute and the query cost are pretty much unrelated.

尹雨沫 2024-10-23 02:00:25

您的查询(基于 SQL)看起来需要读取整个恢复表,并为恢复表中的每一行找到适当的修订行。

由于需要读取整个恢复表,因此对其进行顺序扫描是合适的。它似乎期望大致正确的行数。

然后,每个恢复行将匹配多个修订,它认为最好通过索引扫描和合并联接来完成。据估计,平均每个恢复行将匹配大约 3300 个修订,从而产生 880 亿行。

我不知道有什么方法可以快速选择 880 亿行。

为了获得更准确的估计,您需要一种方法来说服 PostgreSQL 每个恢复所涵盖的修订版远少于 3300 个。

您说您正在恢复修订版本,这表明每个修订版本只应出现一次,即使包含在多个恢复版本中也是如此。

因此,请尝试使用 EXISTS(子查询) 而不是 INNER JOIN

但这不会为您提供恢复修订:

EXPLAIN
SELECT
    r.revision_id
FROM revision_main r
WHERE EXISTS (SELECT 1 FROM revert rvt 
    WHERE r.page_id = rvt.page_id 
    AND r.revision_id > rvt.reverted_to
    AND r.revision_id < rvt.revision_id);

Your query (based on the SQL) looks like it needs to read the entire revert table, and find the appropriate revision rows for each row in the revert table.

Since the entire revert table needs to be read, a sequential scan of it is appropriate. It seems to expect roughly the right number of rows.

Each revert row is then going to match a number of revisions, which it thinks will be best done through an index scan and merge join. It estimates that on average, each revert row will match roughly 3300 revisions, resulting in 88 billion rows.

I don't know of any ways to select 88 billion rows quickly.

In order to get a more accurate estimate, you'll need a way of convincing PostgreSQL that there are a lot less than 3300 revisions covered by each revert.

You say that you are after reverted revisions, indicating that each revision should appear only once, even if included within multiple reverts.

So try using an EXISTS (subquery) instead of an INNER JOIN

This won't give you the revert revisions though:

EXPLAIN
SELECT
    r.revision_id
FROM revision_main r
WHERE EXISTS (SELECT 1 FROM revert rvt 
    WHERE r.page_id = rvt.page_id 
    AND r.revision_id > rvt.reverted_to
    AND r.revision_id < rvt.revision_id);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文