PostgreSQL 多列索引连接与比较(“<”和“>”)运算符
我正在尝试利用 PostgreSQL 中的多列 btree 索引在两个表之间执行烦人的联接。
Table "revision_main"
Column | Type | Modifiers
----------------+------------------------+-----------
revision_id | integer |
page_id | integer |
Indexes:
"revision_main_pkey" UNIQUE, btree (revision_id)
"revision_main_cluster_idx" btree (page_id, "timestamp") CLUSTER
该表包含对 wiki 中页面的修订(约 3 亿行)。我的表中有更多列,但我在本示例中放弃了它们,因为它们不重要。
Table "revert"
Column | Type | Modifiers
--------------------+---------+-----------
page_id | integer |
revision_id | integer |
reverted_to | integer |
Indexes:
"revert_page_between_idx" btree (page_id, reverted_to, revision_id) CLUSTER
该表包含恢复修订(约 2200 万行)。如果修订已被还原,则该 revision_id 将在 revision_main 表中具有一行,并且其 revision_id 将位于 reverted_to 和 revision_id 之间,并共享相同的 page_id。 (请参阅http://en.wikipedia.org/wiki/Wikipedia:Revert,如果您很好奇。)
连接这两个表来获取恢复的修订版本似乎很简单。以下是我的想法:
explain SELECT
r.revision_id,
rvt.revision_id
FROM revision_main r
INNER JOIN revert rvt
ON r.page_id = rvt.page_id
AND r.revision_id > rvt.reverted_to
AND r.revision_id < rvt.revision_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Merge Join (cost=4202878.87..15927491478.57 rows=88418194298 width=8)
Merge Cond: (r.page_id = rvt.page_id)
Join Filter: ((r.revision_id > rvt.reverted_to) AND (r.revision_id < rvt.revision_id))
-> Index Scan using revision_main_page_id_idx on revision_main r (cost=0.00..9740790.61 rows=223163392 width=8)
-> Materialize (cost=4201592.06..4536465.21 rows=26789852 width=12)
-> Sort (cost=4201592.06..4268566.69 rows=26789852 width=12)
Sort Key: rvt.page_id
-> Seq Scan on revert rvt (cost=0.00..438534.52 rows=26789852 width=12)
尽管恢复时的聚集索引应该是 Btree 索引(因此支持“<”和“>”等比较运算符),但查询优化器不会将该索引用于join 和“explain”预计总成本将超过 150 亿(可能明年完成)。
比较运算符是否无法与多列(btree)索引一起使用?我只是做错了吗?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看起来优化器比你更了解它的工作。
如果您选择的不仅仅是表的一小部分(哪个部分取决于硬件,假设为 5%),那么选择和排序整个表比使用索引更快。如果您只是选择几行,那么它应该使用索引。因此它为您的数据提供了正确的查询计划。
至于总成本,这些数字都是废话,只有在单个查询中相互比较时才有用。 (两个非常相似的查询产生的总成本可能有很大不同。)执行时间和查询成本几乎不相关。
It looks like the optimizer knows its job better than you do.
If you are selecting more than a small fraction of a table (what fraction is hardware dependent, let's say 5%), then it is faster to select and order the whole table than it is to use an index. If you were just selecting a few rows, then it should use the index. So it is giving you the correct query plan for your data.
As for the total cost, those numbers are all BS and are only useful when compared in relation to each other, within a single query. (The total costs produced by two very similar queries can be on a very different scale.) The time to execute and the query cost are pretty much unrelated.
您的查询(基于 SQL)看起来需要读取整个恢复表,并为恢复表中的每一行找到适当的修订行。
由于需要读取整个恢复表,因此对其进行顺序扫描是合适的。它似乎期望大致正确的行数。
然后,每个恢复行将匹配多个修订,它认为最好通过索引扫描和合并联接来完成。据估计,平均每个恢复行将匹配大约 3300 个修订,从而产生 880 亿行。
我不知道有什么方法可以快速选择 880 亿行。
为了获得更准确的估计,您需要一种方法来说服 PostgreSQL 每个恢复所涵盖的修订版远少于 3300 个。
您说您正在恢复修订版本,这表明每个修订版本只应出现一次,即使包含在多个恢复版本中也是如此。
因此,请尝试使用
EXISTS(子查询)
而不是INNER JOIN
但这不会为您提供恢复修订:
Your query (based on the SQL) looks like it needs to read the entire revert table, and find the appropriate revision rows for each row in the revert table.
Since the entire revert table needs to be read, a sequential scan of it is appropriate. It seems to expect roughly the right number of rows.
Each revert row is then going to match a number of revisions, which it thinks will be best done through an index scan and merge join. It estimates that on average, each revert row will match roughly 3300 revisions, resulting in 88 billion rows.
I don't know of any ways to select 88 billion rows quickly.
In order to get a more accurate estimate, you'll need a way of convincing PostgreSQL that there are a lot less than 3300 revisions covered by each revert.
You say that you are after reverted revisions, indicating that each revision should appear only once, even if included within multiple reverts.
So try using an
EXISTS (subquery)
instead of anINNER JOIN
This won't give you the revert revisions though: